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Preface 


Machine learning and predictive analytics are becoming one of the key strategies for unlocking 
erowth in a challenging contemporary marketplace .It 1s one of the fastest growing trends 1n modern 
computing and everyone wants to get into the field of machine learning. In order to obtain sufficient 
recognition 1n this field, one must be able to understand and design a machine learning system that 
serves the needs of a project. The idea is to prepare a Learning Path that will help you to tackle the 
real-world complexities of modern machine learning with innovative and cutting-edge techniques. 
Also, it will give you a solid foundation in the machine learning design process, and enable you to 
build customized machine learning models to solve unique problems 
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What this learning path covers 


Module 1, Python Machine Learning, discusses the essential machine algorithms for classification 
and provides practical examples using scikit-learn. It teaches you to prepare variables of different 
types and also speaks about polynomial regression and tree-based approaches. This module focuses 
on open source Python library that allows us to utilize multiple cores of modern GPUs. 


Module 2, Designing Machine Learning Systems with Python, acquaints you with large library of 
packages for machine learning tasks. It introduces broad topics such as big data, data properties, data 
sources, and data processing . You will further explore models that form the foundation of many 
advanced nonlinear techniques. This module will help you 1n understanding model selection and 
parameter tuning techniques that could help in various case studies. 


Module 3, Advanced Machine Learning with Python, helps you to build your skill with deep 
architectures by using stacked denoising autoencoders. This module is a blend of semi-supervised 
learning techniques, RBM and DBN algorithms .Further this focuses on tools and techniques which 
will help in making consistent working process. 
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What you need for this learning path 


Module 1, Python Machine Learning will require an installation of Python 3.4.3 or newer on Mac OS 
X, Linux or Microsoft Windows. Use of Python essential libraries like SciPy, NumPy, scikit-Learn, 
matplotlib, and pandas. 1s essential. 


Before you start, Please refer: 


e The direct link to the Iris dataset would be: https://raw.githubusercontent.com/rasbt/python- 
machine-learning-book/master/code/datasets/iris/iris.data 

e We've added some additional notes to the code notebooks mentioning the offline datasets in case 
there are server errors. 
https ://www.dropbox.com/sh/tq2qdhO0ogfgsktgq/AADIt7esnb1WLOQODnS5q_7Dta?dl=0 

e Module 2, Designing Machine Learning Systems with Python, will need an inclination to learn 
machine learning and the Python V3 software, which you can download from 
https://www.python.org/downloads/. 

e Module 3, Advanced Machine Learning with Python, leverages openly available data and code, 
including open source Python libraries and frameworks. 
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Who this learning path is for 


This title 1s for Data scientist and researchers who are already into the field of Data Science and want 
to see Machine learning 1n action and explore its real-world application. Prior knowledge of Python 
programming and mathematics is must with basic knowledge of machine learning concepts. 
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Reader feedback 


Feedback from our readers is always welcome. Let us know what you think about this course—what 
you liked or disliked. Reader feedback 1s important for us as it helps us develop titles that you will 
really get the most out of. 


To send us general feedback, simply e-mail <feedback@packtpub.com>, and mention the course's 
title 1n the subject of your message. 


If there is a topic that you have expertise 1n and you are interested in either writing or contributing to 
a book, see our author guide at www.packtpub.com/authors. 
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Customer support 


Now that you are the proud owner of a Packt course, we have a number of things to help you to get the 
most from your purchase. 
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Downloading the example code 


You can download the example code files for this course from your account at 
http://www.packtpub.com. If you purchased this course elsewhere, you can visit 
http://www.packtpub.com/support and register to have the files e-mailed directly to you. 


You can download the code files by following these steps: 


1. Login or register to our website using your e-mail address and password. 
. Hover the mouse pointer on the SUPPORT tab at the top. 

. Click on Code Downloads & Errata. 

. Enter the name of the course in the Search box. 

. Select the course for which you're looking to download the code files. 

. Choose from the drop-down menu where you purchased this course from. 
. Click on Code Download. 


SHA Nn BB W WN 


You can also download the code files by clicking on the Code Files button on the course's webpage at 
the Packt Publishing website. This page can be accessed by entering the course's name in the Search 
box. Please note that you need to be logged in to your Packt account. 


Once the file 1s downloaded, please make sure that you unzip or extract the folder using the latest 
version of: 


e WinRAR / 7-Zip for Windows 
e Zipeg/1Zip / UnRarX for Mac 
e 7-Zip / PeaZip for Linux 


The code bundle for the course is also hosted on GitHub at 
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Errata 


Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you 
find a mistake in one of our courses—maybe a mistake in the text or the code—we would be grateful 
if you could report this to us. By doing so, you can save other readers from frustration and help us 
improve subsequent versions of this course. If you find any errata, please report them by visiting 
http://www.packtpub.com/submit-errata, selecting your course, clicking on the Errata Submission 
Form link, and entering the details of your errata. Once your errata are verified, your submission will 
be accepted and the errata will be uploaded to our website or added to any list of existing errata 
under the Errata section of that title. 


To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and 


enter the name of the course in the search field. The required information will appear under the 
Errata section. 
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Piracy 


Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we 
take the protection of our copyright and licenses very seriously. If you come across any illegal copies 
of our works in any form on the Internet, please provide us with the location address or website name 
immediately so that we can pursue a remedy. 


Please contact us at <copyright@packtpub.com> with a link to the suspected pirated material. 


We appreciate your help in protecting our authors and our ability to bring you valuable content. 
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Questions 


If you have a problem with any aspect of this course, you can contact us at 
<questions@packtpub.com>, and we will do our best to address the problem. 
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Part 1. Module I 


Python Machine Learning 


Leverage benefits of machine learning techniques using Python 
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Chapter 1. Giving Computers the Ability to 
Learn from Data 


In my opinion, machine learning, the application and science of algorithms that makes sense of data, 
is the most exciting field of all the computer sciences! We are living in an age where data comes in 
abundance; using the self-learning algorithms from the field of machine learning, we can turn this data 
into knowledge. Thanks to the many powerful open source libraries that have been developed in 
recent years, there has probably never been a better time to break into the machine learning field and 
learn how to utilize powerful algorithms to spot patterns in data and make predictions about future 
events. 


In this chapter, we will learn about the main concepts and different types of machine learning. 
Together with a basic introduction to the relevant terminology, we will lay the groundwork for 
successfully using machine learning techniques for practical problem solving. 


In this chapter, we will cover the following topics: 


The general concepts of machine learning 
The three types of learning and basic terminology 
The building blocks for successfully designing machine learning systems 


@ 
@ 
@ 
e Installing and setting up Python for data analysis and machine learning 
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Building intelligent machines to transform data 
into knowledge 


In this age of modern technology, there is one resource that we have 1n abundance: a large amount of 
structured and unstructured data. In the second half of the twentieth century, machine learning evolved 
as a Subfield of artificial intelligence that involved the development of self-learning algorithms to 
gain knowledge from that data 1n order to make predictions. Instead of requiring humans to manually 
derive rules and build models from analyzing large amounts of data, machine learning offers a more 
efficient alternative for capturing the knowledge in data to gradually improve the performance of 
predictive models, and make data-driven decisions. Not only is machine learning becoming 
increasingly important in computer science research but it also plays an ever greater role in our 
everyday life. Thanks to machine learning, we enjoy robust e-mail spam filters, convenient text and 
voice recognition software, reliable Web search engines, challenging chess players, and, hopefully 
soon, safe and efficient self-driving cars. 
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The three different types of machine learning 


In this section, we will take a look at the three types of machine learning: supervised learning, 
unsupervised learning, and reinforcement learning. We will learn about the fundamental differences 
between the three different learning types and, using conceptual examples, we will develop an 
intuition for the practical problem domains where these can be applied: 


Unsupervised Supervised 
Léar ia Vi 5 Lea r n | n e 


Reinforcement 
Learning 
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Making predictions about the future with supervised 
learning 


The main goal 1n supervised learning 1s to learn a model from labeled training data that allows us to 
make predictions about unseen or future data. Here, the term supervised refers to a set of samples 
where the desired output signals (labels) are already known. 


Considering the example of e-mail spam filtering, we can train a model using a supervised machine 
learning algorithm on a corpus of labeled e-mail, e-mail that are correctly marked as spam or not- 
spam, to predict whether a new e-mail belongs to either of the two categories. A supervised learning 
task with discrete class labels, such as in the previous e-mail spam-filtering example, is also called a 
classification task. Another subcategory of supervised learning 1s regression, where the outcome 
signal is a continuous value: 


Labels 


Training Data 





Machine Learning 
Algorithm 


| New Data | >| Predictive Model || »| Prediction 








Classification for predicting class labels 


Classification is a subcategory of supervised learning where the goal 1s to predict the categorical 
class labels of new instances based on past observations. Those class labels are discrete, unordered 
values that can be understood as the group memberships of the instances. The previously mentioned 
example of e-mail-spam detection represents a typical example of a binary classification task, where 
the machine learning algorithm learns a set of rules in order to distinguish between two possible 
classes: spam and non-spam e-mail. 


However, the set of class labels does not have to be of a binary nature. The predictive model learned 
by a supervised learning algorithm can assign any class label that was presented 1n the training 
dataset to a new, unlabeled instance. A typical example of a multi-class classification task 1s 
handwritten character recognition. Here, we could collect a training dataset that consists of multiple 


handwritten examples of each letter in the alphabeteNow, if a user provides a new handwritten 
www.wowebook.org 


character via an input device, our predictive model will be able to predict the correct letter 1n the 
alphabet with certain accuracy. However, our machine learning system would be unable to correctly 
recognize any of the digits zero to nine, for example, 1f they were not part of our training dataset. 


The following figure illustrates the concept of a binary classification task given 30 training samples: 
15 training samples are labeled as negative class (circles) and 15 training samples are labeled as 
positive class (plus signs). In this scenario, our dataset 1s two-dimensional, which means that each 


sample has two values associated with it: “ and “2. Now, we can use a Supervised machine learning 
algorithm to learn a rule—the decision boundary represented as a black dashed line—that can 


separate those two classes and classify new data into each of those two categories given its “\ and 


x 
> values: 





Regression for predicting continuous outcomes 


We learned 1n the previous section that the task of classification is to assign categorical, unordered 
labels to instances. A second type of supervised learning 1s the prediction of continuous outcomes, 
which 1s also called regression analysis. In regression analysis, we are given a number of predictor 
(explanatory) variables and a continuous response variable (outcome), and we try to find a 
relationship between those variables that allows us to predict an outcome. 


For example, let's assume that we are interested in predicting the Math SAT scores of our students. If 
there 1s a relationship between the time spent studying for the test and the final scores, we could use it 
as training data to learn a model that uses the study time to predict the test scores of future students 


who are planning to take this test. WOW! eBook 
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Note 


The term regression was devised by Francis Galton in his article Regression Towards Mediocrity in 
Hereditary Stature in 1886. Galton described the biological phenomenon that the variance of height 
in a population does not increase over time. He observed that the height of parents is not passed on to 
their children but the children's height 1s regressing towards the population mean. 


The following figure illustrates the concept of /inear regression. Given a predictor variable x and a 
response variable y, we fit a straight line to this data that minimizes the distance—most commonly the 
average squared distance—between the sample points and the fitted line. We can now use the 
intercept and slope learned from this data to predict the outcome variable of new data: 
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Solving interactive problems with reinforcement learning 


Another type of machine learning is reinforcement learning. In reinforcement learning, the goal is to 
develop a system (agent) that improves its performance based on interactions with the environment. 
Since the information about the current state of the environment typically also includes a so-called 
reward signal, we can think of reinforcement learning as a field related to supervised learning. 
However, in reinforcement learning this feedback is not the correct ground truth label or value, but a 
measure of how well the action was measured by a reward function. Through the interaction with the 
environment, an agent can then use reinforcement learning to learn a series of actions that maximizes 
this reward via an exploratory trial-and-error approach or deliberative planning. 


A popular example of reinforcement learning 1s a chess engine. Here, the agent decides upon a series 
of moves depending on the state of the board (the environment), and the reward can be defined as win 
or Jose at the end of the game: 


Environment 
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Discovering hidden structures with unsupervised 
learning 


In supervised learning, we know the right answer beforehand when we train our model, and 1n 
reinforcement learning, we define a measure of reward for particular actions by the agent. In 
unsupervised learning, however, we are dealing with unlabeled data or data of unknown structure. 
Using unsupervised learning techniques, we are able to explore the structure of our data to extract 
meaningful information without the guidance of a known outcome variable or reward function. 


Finding subgroups with clustering 


Clustering 1s an exploratory data analysis technique that allows us to organize a pile of information 
into meaningful subgroups (clusters) without having any prior knowledge of their group 
memberships. Each cluster that may arise during the analysis defines a group of objects that share a 
certain degree of similarity but are more dissimilar to objects in other clusters, which is why 
clustering 1s also sometimes called "unsupervised classification." Clustering is a great technique for 
structuring information and deriving meaningful relationships among data, For example, it allows 
marketers to discover customer groups based on their interests in order to develop distinct marketing 
programs. 


The figure below illustrates how clustering can be applied to organizing unlabeled data into three 


x 


distinct groups based on the similarity of their features “ and “2: 
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Dimensionality reduction for data compression 


Another subfield of unsupervised learning is dimensionality reduction. Often we are working with 
data of high dimensionality—each observation comes with a high number of measurements—that can 
present a challenge for limited storage space and the computational performance of machine learning 
algorithms. Unsupervised dimensionality reduction 1s a commonly used approach in feature 
preprocessing to remove noise from data, which can also degrade the predictive performance of 
certain algorithms, and compress the data onto a smaller dimensional subspace while retaining most 
of the relevant information. 


Sometimes, dimensionality reduction can also be useful for visualizing data—for example, a high- 
dimensional feature set can be projected onto one-, two-, or three-dimensional feature spaces in order 
to visualize it via 3D- or 2D-scatterplots or histograms. The figure below shows an example where 
non-linear dimensionality reduction was applied to compress a 3D Swiss Roll onto a new 2D feature 
subspace: 
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An introduction to the basic terminology and 
notations 


Now that we have discussed the three broad categories of machine learning—supervised, 
unsupervised, and reinforcement learning—let us have a look at the basic terminology that we will be 
using 1n the next chapters. The following table depicts an excerpt of the /ris dataset, which 1s a 
classic example 1n the field of machine learning. The Iris dataset contains the measurements of 150 
iris flowers from three different species: Setosa, Versicolor, and Virginica. Please check if this 1s 
replaced. Here, each flower sample represents one row in our data set, and the flower measurements 
in centimeters are stored as columns, which we also call the features of the dataset: 


Petal ~ 
Samples ia. 
(instances, observations) 


Sepal Sepal Petal Petal Class 
length width length width label 


Ee a a 
CCC 
A 


ee [oe ss omy 
? ~~ Sepal 
¥ Class labels 


Features (targets) 
(attributes, measurements, dimensions) 





To keep the notation and implementation simple yet efficient, we will make use of some of the basics 
of linear algebra. In the following chapters, we will use a matrix and vector notation to refer to our 
data. We will follow the common convention to represent each sample as separate row in a feature 


matrix * , where each feature is stored as a separate column. 


The Iris dataset, consisting of 150 samples and 4 features, can then be written as a L50x4 matrix 
X = | ina : 
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x, X5 xX A 
Ae) (2) (2) AZ) 
Ay X, salle | Ny 
150 150 150 150 
: q A x! x 
Note 


For the rest of this book, we will use the superscript (i) to refer to the ith training sample, and the 
subscript / to refer to the jth dimension of the training dataset. 


(xe R™)} 


We use lower-case, bold-face letters to refer to vectors and upper-case, bold-face letters 


to refer to matrices, respectively (XE R™ )), To refer to single elements 1n a vector or matrix, we 


[rl 
, oe An) a, ; 
write the letters initalics * or ‘'"), respectively). 


130 


For example, “1 refers to the first dimension of flower sample 150, the sepal width. Thus, each row 
in this feature matrix represents one flower instance and can be written as four-dimensional column 


alt) Ae) Af) (i) (i). 
(t) _ 7p lx4 ‘A = E X35 A 3 A 4 | 
vector ¥ ER”, ~ | 


Af) — 7p 1501 
Each feature dimension is a 150-dimensional row vector ¥ © ~~ , for example: 


— (150) 
A 


Similarly, we store the target variables (here: class labels) as a 150-dimensional column vector 
(1) | 
v 


j= |... (y E ; Setosa, Versicolor. Virginica! ) 
_(1S0) | 
y | WOW! eBook 
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A roadmap for building machine learning 
Systems 


In the previous sections, we discussed the basic concepts of machine learning and the three different 
types of learning. In this section, we will discuss other important parts of a machine learning system 
accompanying the learning algorithm. The diagram below shows a typical workflow diagram for 

using machine learning in predictive modeling, which we will discuss in the following subsections: 


| Feature Extraction and Scaling | 
| Feature Selection 
| Dimensionality Reduction 

| sampling 


oe | 


Preprocessing i \ Learning ! | “Evaluation Prediction 


eee = 


E 
E 
| 
I 
E 
i 
i 
t 
E 
t 
| 
| 
i 


| Model Selection 
Cross-Validation 

| Performance Metrics 

Hyperparameter Optimization | 
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Preprocessing — getting data into shape 


Raw data rarely comes in the form and shape that 1s necessary for the optimal performance of a 
learning algorithm. Thus, the preprocessing of the data is one of the most crucial steps in any machine 
learning application. If we take the Iris flower dataset from the previous section as an example, we 
could think of the raw data as a series of flower images from which we want to extract meaningful 
features. Useful features could be the color, the hue, the intensity of the flowers, the height, and the 
flower lengths and widths. Many machine learning algorithms also require that the selected features 
are on the same scale for optimal performance, which 1s often achieved by transforming the features 
in the range [0, 1] or a standard normal distribution with zero mean and unit variance, as we will see 
in the later chapters. 


Some of the selected features may be highly correlated and therefore redundant to a certain degree. In 
those cases, dimensionality reduction techniques are useful for compressing the features onto a lower 
dimensional subspace. Reducing the dimensionality of our feature space has the advantage that less 
storage space is required, and the learning algorithm can run much faster. 


To determine whether our machine learning algorithm not only performs well on the training set but 
also generalizes well to new data, we also want to randomly divide the dataset into a separate 
training and test set. We use the training set to train and optimize our machine learning model, while 
we keep the test set until the very end to evaluate the final model. 
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Training and selecting a predictive model 


As we will see in later chapters, many different machine learning algorithms have been developed to 
solve different problem tasks. An important point that can be summarized from David Wolpert's 
famous No Free Lunch Theorems 1s that we can't get learning "for free" (The Lack of A Priori 
Distinctions Between Learning Algorithms, D.H. Wolpert 1996; No Free Lunch Theorems for 
Optimization, D.H. Wolpert and W.G. Macready, 1997). Intuitively, we can relate this concept to the 
popular saying, "/ suppose it is tempting, if the only tool you have is a hammer, to treat everything 
as if it were a nail" (Abraham Maslow, 1966). For example, each classification algorithm has its 
inherent biases, and no single classification model enjoys superiority if we don't make any 
assumptions about the task. In practice, it 1s therefore essential to compare at least a handful of 
different algorithms 1n order to train and select the best performing model. But before we can 
compare different models, we first have to decide upon a metric to measure performance. One 
commonly used metric 1s classification accuracy, which is defined as the proportion of correctly 
classified instances. 


One legitimate question to ask 1s: how do we know which model performs well on the final test 
dataset and real-world data if we don't use this test set for the model selection but keep it for the 
final model evaluation? In order to address the issue embedded in this question, different cross- 
validation techniques can be used where the training dataset is further divided into training and 
validation subsets in order to estimate the generalization performance of the model. Finally, we 
also cannot expect that the default parameters of the different learning algorithms provided by 
software libraries are optimal for our specific problem task. Therefore, we will make frequent use of 
hyperparameter optimization techniques that help us to fine-tune the performance of our model in 
later chapters. Intuitively, we can think of those hyperparameters as parameters that are not learned 
from the data but represent the knobs of a model that we can turn to improve its performance, which 
will become much clearer in later chapters when we see actual examples. 
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Evaluating models and predicting unseen data Instances 


After we have selected a model that has been fitted on the training dataset, we can use the test dataset 
to estimate how well it performs on this unseen data to estimate the generalization error. If we are 
satisfied with its performance, we can now use this model to predict new, future data. It 1s important 
to note that the parameters for the previously mentioned procedures—such as feature scaling and 
dimensionality reduction—are solely obtained from the training dataset, and the same parameters are 
later re-applied to transform the test dataset, as well as any new data samples—the performance 
measured on the test data may be overoptimistic otherwise. 
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Using Python for machine learning 


Python 1s one of the most popular programming languages for data science and therefore enjoys a 
large number of useful add-on libraries developed by its great community. 


Although the performance of interpreted languages, such as Python, for computation-intensive tasks 1s 
inferior to lower-level programming languages, extension libraries such as NumPy and SciPy have 
been developed that build upon lower layer Fortran and C implementations for fast and vectorized 
operations on multidimensional arrays. 


For machine learning programming tasks, we will mostly refer to the scikit-learn library, which is 
one of the most popular and accessible open source machine learning libraries as of today. 
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Installing Python packages 


Python 1s available for all three major operating systems—Microsoft Windows, Mac OS X, and 
Linux—and the installer, as well as the documentation, can be downloaded from the official Python 


website: https://www.python.org. 


This book is written for Python version >= 3.4.3, and it is recommended you use the most recent 
version of Python 3 that is currently available, although most of the code examples may also be 
compatible with Python >= 2.7.10. If you decide to use Python 2.7 to execute the code examples, 
please make sure that you know about the major differences between the two Python versions. A good 
Summary about the differences between Python 3.4 and 2.7 can be found at 
https://wiki.python.org/moin/Python2orPython3. 


The additional packages that we will be using throughout this book can be installed via the pip 
installer program, which has been part of the Python standard library since Python 3.3. More 


information about pip can be found at https://docs.python.org/3/installing/index.htnl. 


After we have successfully installed Python, we can execute pip from the command line terminal to 
install additional Python packages: 


pip install SomePackage 


Already installed packages can be updated via the --upgrade flag: 


pip install SomePackage --upgrade 


A highly recommended alternative Python distribution for scientific computing is Anaconda by 
Continuum Analytics. Anaconda is a free—including commercial use—enterprise-ready Python 
distribution that bundles all the essential Python packages for data science, math, and engineering 1n 
one user-friendly cross-platform distribution. The Anaconda installer can be downloaded at 
http://continuum.i0/downloads#py34, and an Anaconda quick start-guide is available at 
https://store.continuum.10/static/img/Anaconda-Quickstart.pdf. 


After successfully installing Anaconda, we can install new Python packages using the following 
command: 


conda install SomePackage 


Existing packages can be updated using the following command: 


conda update SomePackage 


Throughout this book, we will mainly use NumPy's multi-dimensional arrays to store and manipulate 
data. Occasionally, we will make use of pandas, which is a library built on top of NumPy that 
provides additional higher level data manipulation tools that make working with tabular data even 


more convenient. To augment our learning experithetoatid \ visualize quantitative data, which 1s often 
WwW.wWOWeDOOK.Or'G 


extremely useful to intuitively make sense of 1t, we will use the very customizable matplotlib library. 


The version numbers of the major Python packages that were used for writing this book are listed 
below. Please make sure that the version numbers of your installed packages are equal to, or greater 
than, those version numbers to ensure the code examples run correctly: 


NumPy 1.9.1 
SciPy 0.14.0 
scikit-learn 0.15.2 
matplotlib 1.4.0 
pandas 0.15.2 
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Summary 


In this chapter, we explored machine learning on a very high level and familiarized ourselves with the 
big picture and major concepts that we are going to explore in the next chapters in more detail. 


We learned that supervised learning is composed of two important subfields: classification and 
regression. While classification models allow us to categorize objects into known classes, we can 
use regression analysis to predict the continuous outcomes of target variables. Unsupervised learning 
not only offers useful techniques for discovering structures 1n unlabeled data, but 1t can also be useful 
for data compression in feature preprocessing steps. 


We briefly went over the typical roadmap for applying machine learning to problem tasks, which we 
will use as a foundation for deeper discussions and hands-on examples in the following chapters. 
Eventually, we set up our Python environment and installed and updated the required packages to get 
ready to see machine-learning 1n action. 


In the following chapter, we will implement one of the earliest machine learning algorithms for 
classification that will prepare us for Chapter 3, A Jour of Machine Learning Classifiers Using 
Scikit-learn, where we cover more advanced machine learning algorithms using the scikit-learn open 
source machine learning library. Since machine learning algorithms learn from data, it 1s critical that 
we feed them useful information, and in Chapter 4, Building Good Training Sets—Data 
Preprocessing we will take a look at important data preprocessing techniques. In Chapter 5, 
Compressing Data via Dimensionality Reduction, we will learn about dimensionality reduction 
techniques that can help us to compress our dataset onto a lower-dimensional feature subspace, which 
can be beneficial for computational efficiency. An important aspect of building machine learning 
models is to evaluate their performance and to estimate how well they can make predictions on new, 
unseen data. In Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter 
Tuning we will learn all about the best practices for model tuning and evaluation. In certain 
scenarios, we still may not be satisfied with the performance of our predictive model although we 
may have spent hours or days extensively tuning and testing. In Chapter 7, Combining Different 
Models for Ensemble Learning we will learn how to combine different machine learning models to 
build even more powerful predictive systems. 


After we covered all of the important concepts of a typical machine learning pipeline, we will 
implement a model for predicting emotions in text in Chapter 8, Applying Machine Learning to 
Sentiment Analysis, and in Chapter 9, Embedding a Machine Learning Model into a Web 
Application, we will embed it into a Web application to share 1t with the world. In Chapter 10, 
Predicting Continuous Target Variables with Regression Analysis we will then use machine learning 
algorithms for regression analysis that allow us to predict continuous output variables, and in Chapter 
11, Working with Unlabelled Data — Clustering Analysis we will apply clustering algorithms that 
will allow us to find hidden structures in data. The last chapter in this book will cover artificial 
neural networks that will allow us to tackle complex problems, such as image and speech recognition, 
which is currently one of the hottest topics in Hachakedearning research. 
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Chapter 2. Training Machine Learning 
Algorithms for Classification 


In this chapter, we will make use of one of the first algorithmically described machine learning 
algorithms for classification, the perceptron and adaptive linear neurons. We will start by 
implementing a perceptron step by step in Python and training it to classify different flower species in 
the Iris dataset. This will help us to understand the concept of machine learning algorithms for 
classification and how they can be efficiently implemented in Python. Discussing the basics of 
optimization using adaptive linear neurons will then lay the groundwork for using more powerful 
classifiers via the scikit-learn machine-learning library in Chapter 3, A Tour of Machine Learning 
Classifiers Using Scikit-learn. 


The topics that we will cover in this chapter are as follows: 


e Building an intuition for machine learning algorithms 
e Using pandas, NumPy, and matplotlib to read in, process, and visualize data 
e Implementing linear classification algorithms 1n Python 
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Artificial neurons — a brief glimpse into the 
early history of machine learning 


Before we discuss the perceptron and related algorithms in more detail, let us take a brief tour 
through the early beginnings of machine learning. Trying to understand how the biological brain 
works to design artificial intelligence, Warren McCullock and Walter Pitts published the first concept 
of a simplified brain cell, the so-called McCullock-Pitts (MCP) neuron, in 1943 (W. S. McCulloch 
and W. Pitts. 4 Logical Calculus of the Ideas Immanent in Nervous Activity. The bulletin of 
mathematical biophysics, 5(4):115—133, 1943). Neurons are interconnected nerve cells in the brain 
that are involved in the processing and transmitting of chemical and electrical signals, which 1s 
illustrated in the following figure: 


Input | 2 ) i Output 
er Dendrites — _ " ee 
Signals o_o» : ws, a! J Signals 


— Cell nucleus 


Axon 
terminals 





Myelin sheath 


McCullock and Pitts described such a nerve cell as a simple logic gate with binary outputs; multiple 
signals arrive at the dendrites, are then integrated into the cell body, and, 1f the accumulated signal 
exceeds a certain threshold, an output signal 1s generated that will be passed on by the axon. 


Only a few years later, Frank Rosenblatt published the first concept of the perceptron learning rule 
based on the MCP neuron model (F. Rosenblatt, The Perceptron, a Perceiving and Recognizing 
Automaton. Cornell Aeronautical Laboratory, 1957). With his perceptron rule, Rosenblatt proposed 
an algorithm that would automatically learn the optimal weight coefficients that are then multiplied 
with the input features in order to make the decision of whether a neuron fires or not. In the context of 
supervised learning and classification, such an algorithm could then be used to predict if a sample 
belonged to one class or the other. 


More formally, we can pose this problem as a binary classification task where we refer to our two 
classes as 1 (positive class) and -1 (negative class) for simplicity. We can then define an activation 


6(: 


function * * that takes a linear combinationesypestaag input values ** and a corresponding weight 
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| Sy Z=Wx+...+WwXx 
vector  , where = is the so-called net input ( ” mm): 


Wi Ay 


W A 


mn LL” m _| 


(2) 


a ; wt) 
Now, if the activation of a particular sample * _, that is, the output of me , 1S greater than a defined 


threshold ?, we predict class | and class -1, otherwise, in the perceptron algorithm, the activation 


function b(:) 


function: 


(2) = 


is a simple unit step function, which is sometimes also called the Heaviside step 


[ Lg2z2¢ 


|-1 otherwise 


For simplicity, we can bring the threshold 7 to the left side of the equation and define a weight-zero 


as ‘0 = “e and *° ~ , SO that we write # 1na more compact form 

Hz) lgvz2¢ 
— _ chee ones a an G9 a eraesihenen 
Z= WX t WX, +..-+ WX, = WX —l otherwise | 
Note 


In the following sections, we will often make use of basic notations from linear algebra. For example, 
we will abbreviate the sum of the products of the values in * and ™ using a vector dot product, 
whereas superscript T stands for transpose, which 1s an operation that transforms a column vector 
into a row vector and vice versa: 


A T 
<= W445 fF Wi, Th he. = 2. ae YH Ww XN 
4 j=0 J of 


me 
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4 
L 2 3]x § |=1x442x54+3x6=32 


6 
For example: 


Furthermore, the transpose operation can also be applied to a matrix to reflect 1t over its diagonal, for 
example: 














In this book, we will only use the very basic concepts from linear algebra. However, if you need a 
quick refresher, please take a look at Zico Kolter's excellent Linear Algebra Review and Reference, 
which 1s freely available at http://www.cs.cmu.edu/~zkolter/course/linalg/linal 





The following figure illustrates how the net input = = WX is squashed into a binary output (-1 or 1) 
by the activation function of the perceptron (left subfigure) and how it can be used to discriminate 
between two linearly separable classes (right subfigure): 
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The whole idea behind the MCP neuron and Rosenktattis thresholded perceptron model 1s to use a 
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reductionist approach to mimic how a single neuron in the brain works: it either fires or it doesn't. 
Thus, Rosenblatt's initial perceptron rule is fairly simple and can be summarized by the following 
Steps: 


1. Initialize the weights to 0 or small random numbers. 
2. For each training sample eM perform the following steps: 


1. Compute the output value -’ . 
2. Update the weights. 


Here, the output value is the class label predicted by the unit step function that we defined earlier, and 
the simultaneous update of each weight ‘in the weight vector can be more formally written as: 


wow, +Aw. 


The value of dec , which is used to update the weight "3 , 1s calculated by the perceptron learning 
rule: 


, eh al 20) 8 
Aw, =1) (3 y x, 


[i 


Where 7 is the learning rate (a constant between 0.0 and 1.0), "is the true class label of the ! th 
~ (i) 
training sample, and - is the predicted class label. It is important to note that all weights in the 
(i) 


weight vector are being updated simultaneously, which means that we don't recompute the before 


all of the weights ase were updated. Concretely, for a 2D dataset, we would write the update as 
follows: 


| i) i) | 
Aw, =n| vy") — output ) 
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ir] 


Aw, =77 (y" —output’” )x, 


ie 


Aw, = ny" — output"? )x, 


Before we implement the perceptron rule in Python, let us make a simple thought experiment to 
illustrate how beautifully simple this learning rule really 1s. In the two scenarios where the 
perceptron predicts the class label correctly, the weights remain unchanged: 


Aw, = n(-1" --1")x = 


Jt 


iF) 


Aw, =17 ( j) — 4) x. 


J 


= 0 


However, in the case of a wrong prediction, the weights are being pushed towards the direction of the 
positive or negative target class, respectively: 


if) 


Aw, =7) (a ——|"’ ; = (2) x, 


(i) (i) 


Aw, = n(-1 -1()x =n(-2)x, 


To get a better intuition for the multiplicative factor “ , let us go through another simple example, 
where: 
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Let's assume that “/ ~ a , and we misclassify this sample as -1. In this case, we would increase the 
corresponding weight by | so that the activation “; "i will be more positive the next time we 
encounter this sample and thus will be more likely to be above the threshold of the unit step function 


to classify the sample as +1: 


Aw, =(1 --1 )0.5° =(2)0.5" =1 


The weight update is proportional to the value of “For example, if we have another sample 


“) * that is incorrectly classified as -1, we'd push the decision boundary by an even larger extend 


to classify this sample correctly the next time: 


Aw, =(1" --1) 2° = (2)2" =4 


It is important to note that the convergence of the perceptron is only guaranteed 1f the two classes are 
linearly separable and the learning rate is sufficiently small. If the two classes can't be separated by a 
linear decision boundary, we can set a maximum number of passes over the training dataset (epochs) 
and/or a threshold for the number of tolerated misclassifications—the perceptron would never stop 
updating the weights otherwise: 


 Linearly separable Not linearly separable Not linearly separable 


oO o 
oo — . 9 o 


- Oo 
o* + 
o °o 
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Tip 
Downloading the example code 


You can download the example code files from your account at http://www.packtpub.com for all the 
Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit 
http://www.packtpub.com/support and register to have the files e-mailed directly to you. 





Now, before we jump into the implementation 1n the next section, let us summarize what we just 
learned in a simple figure that illustrates the general concept of the perceptron: 
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The preceding figure illustrates how the perceptron receives the inputs of a sample* and combines 
them with the weights to compute the net input. The net input is then passed on to the activation 
function (here: the unit step function), which generates a binary output -1 or +1—the predicted class 
label of the sample. During the learning phase, this output is used to calculate the error of the 
prediction and update the weights. 
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Implementing a perceptron learning algorithm 
in Python 


In the previous section, we learned how Rosenblatt's perceptron rule works; let us now go ahead and 
implement it in Python and apply it to the Iris dataset that we introduced in Chapter 1, Giving 
Computers the Ability to Learn from Data. We will take an objected-oriented approach to define the 
perceptron interface as a Python class, which allows us to initialize new perceptron objects that can 
learn from data via a £it method, and make predictions via a separate predict method. As a 
convention, we add an underscore to attributes that are not being created upon the initialization of the 
object but by calling the object's other methods—for example, self.w . 


Note 


If you are not yet familiar with Python's scientific libraries or need a refresher, please see the 
following resources: 


NumPy: http://wiki.scipy.org/Tentative NumPy Tutorial 


Pandas: http://pandas.pydata.org/pandas-docs/stable/tutorials. html 


Matplotlib: http://matplotlib.org/users/beginner. html 


Also, to better follow the code examples, I recommend you download the [Python notebooks from the 
Packt website. For a general introduction to [Python notebooks, please visit 
https://1ipython.org/ipython-doc/3/notebook/index.html. 


import numpy as np 
class Perceptron(object): 
"""Perceptron classifier. 


Parameters 
eta : float 

Learning rate (between 0.0 and 1.0) 
O Leer 2 i10t 

Passes over the training dataset. 


Attributes 
Wo  JC=array 
Weights after fitting. 
errors. + iast 
Number of misclassifications in every epoch. 


CCt ._I1fit ..(Selt, Ste=0<01, HM. Aver=10)% 
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Soliant. tee = 2 eer 


def fit(self, X, y): 
vei Eran ing Oa te. 


Parameters 
xX = {avray-like}, shape = [nm samples, mm Peatures| 
iteLnimg VeClCOrs, where ti Samples 
1s the number of samples and 
m Ceavures io Pie Bimbe. Cr feacure.. 
Vy * ettay-like, shape — |n samples] 
Target values. 


Returns 


self : object 


woes 


Sscli.W = Dp.72eros(l. + Assnhape| 1.) 
Scliserrore = |] 


roe  . i@ tance (Selian Tuer): 
errors = 0 
fOr Xi, Percocet 1 Zip (ls; Vy): 
update = self.eta * (target - self.predict (x1) ) 


Sselicw (12) w= Update * xi 
selt.w | 0] == update 
errors += int(update != 0.0) 


peli e@erLOre. «2D > CNG errors) 
return self 


Get net 1npuc(selre, xX): 
eye CoLveiLeve: Det 2apue 
FECUGm Np.d0u (x, Seltew lilel) + selisw [0 


def predict(self, X): 
""™Return class label after unit step"™"" 
Felurn Np.where(selr.net 1nput{s) 2= 0.0, dy =) 


Using this perceptron implementation, we can now initialize new Perceptron objects witha given 
learning rate eta andn_ iter, whichis the number of epochs (passes over the training set). Via the 


£it method we initialize the weights inself.w toa zero-vector R”” where ” stands for the 
number of dimensions (features) 1n the dataset where we add | for the zero-weight (that is, the 
threshold). 


Note 


NumPy indexing for one-dimensional arrays works similarly to Python lists using the square-bracket 
({]) notation. For two-dimensional arrays, the first indexer refers to the row number, and the second 
indexer to the column number. For example, we would use x[2, 3] to select the third row and fourth 


lumn of a 2D array x entidnciioaeh 
column oO y x. www.wowebook.org 


After the weights have been initialized, the £it method loops over all individual samples in the 
training set and updates the weights according to the perceptron learning rule that we discussed in the 
previous section. The class labels are predicted by the predict method, whichis also called in the 
fit method to predict the class label for the weight update, but predict can also be used to predict 
the class labels of new data after we have fitted our model. Furthermore, we also collect the number 
of misclassifications during each epoch in the list self.errors_ so that we can later analyze how 
well our perceptron performed during the training. The np. dot function that is used in the net input 


method simply calculates the vector dot product ™ a, 


Note 


Instead of using NumPy to calculate the vector dot product between two arrays a and b via a. dot (b) 
Or np.dot(a, b), we could also perform the calculation in pure Python via sum([4j*j for i,j in 
zip(a, b)]. However, the advantage of using NumPy over classic Python for-loop structures is that 
its arithmetic operations are vectorized. Vectorization means that an elemental arithmetic operation 
is automatically applied to all elements 1n an array. By formulating our arithmetic operations as a 
sequence of instructions on an array rather than performing a set of operations for each element one at 
a time, we can make better use of our modern CPU architectures with Single Instruction, Multiple 
Data (SIMD) support. Furthermore, NumPy uses highly optimized linear algebra libraries, such as 
Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) that have 
been written in C or Fortran. Lastly, NumPy also allows us to write our code in a more compact and 
intuitive way using the basics of linear algebra, such as vector and matrix dot products. 


WOW! eBook 
www.wowebook.org 


Training a perceptron model on the Iris dataset 


To test our perceptron implementation, we will load the two flower classes Setosa and Versicolor 
from the Iris dataset. Although, the perceptron rule 1s not restricted to two dimensions, we will only 
consider the two features sepal length and petal length for visualization purposes. Also, we only 
chose the two flower classes Setosa and Versicolor for practical reasons. However, the perceptron 
algorithm can be extended to multi-class classification—for example, through the One-vs.-All 
technique. 


Note 


One-vs.-All (OvA), or sometimes also called One-vs.-Rest (OvR), is a technique, us to extend a 
binary classifier to multi-class problems. Using OvA, we can train one classifier per class, where the 
particular class is treated as the positive class and the samples from all other classes are considered 


as the negative class. If we were to classify a new data sample, we would use our O(=) classifiers, 
where ” is the number of class labels, and assign the class label with the highest confidence to the 
particular sample. In the case of the perceptron, we would use OvA to choose the class label that 1s 
associated with the largest absolute net input value. 


First, we will use the pandas library to load the Iris dataset directly from the UCI Machine Learning 
Repository into a DataFrame object and print the last five lines via the tail method to check that the 
data was loaded correctly: 


>>> import pandas as pd 

>>> df = pd.read csv('https://archive.ics.uci.edu/ml/' 

- 'machine-learning-databases/iris/iris.data', header=None) 
>>> df.tail() 
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Next, we extract the first 100 class labels that correspond to the 50 /ris-Setosa and 50 Iris- 
Versicolor flowers, respectively, and convert the class labels into the two integer class labels 1 
(Versicolor) and -1 (Setosa) that we assign to a vector y where the values method of a pandas 
DataFrame yields the corresponding NumPy representation. Similarly, we extract the first feature 
column (sepal length) and the third feature column (petal length) of those 100 training samples and 
assign them to a feature matrix x, which we can visualize via a two-dimensional scatter plot: 


>>> import matplotlib.pyplot as plt 
>>> import numpy as np 


>>> y = df.iloc[0:100, 4].values 
>>> y = np.where(y == 'Iris-setosa', -1, 1) 
>>> X = d£.iloc[0:100, [0, 2]].values 


>>> plt.scatter(X[:50, OJ], X[:50, 1], 

as color='red', marker='o', label='setosa' ) 

>>> plt.scatter(X[50:100, Oj], X[50:100, 1], 

— color='blue', marker='x', label='versicolor' ) 
>>> plt.xlabel('petal length') 

>>> plt.ylabel('sepal length') 

>>> plt.legend(loc='upper left') 

>>> plt.show() 


After executing the preceding code example we should now see the following scatterplot: 


sepal length [cm] 





4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 
petal length [cm] 


Now it's time to train our perceptron algorithm ontbecttis, data subset that we just extracted. Also, we 


will plot the misclassification error for each epoch to check if the algorithm converged and 
found a decision boundary that separates the two Iris flower classes: 

>>> ppn = Perceptron(eta=0.1, n iter=10) 

>>> ppn.fit(X, y) 

>>> plt.plot(range(1, len(ppn.errors ) + 1), ppn.errors , 

js marker='o' ) 

>>> plt.xlabel ('Epochs') 

>>> plt.ylabel('Number of misclassifications' ) 

>>> plt. show () 


After executing the preceding code, we should see the plot of the misclassification errors versus the 
number of epochs, as shown next: 


3.0 
Za 
2.0 
1.5 


1.0 


Number of misclassifications 


Q.5 





As we can see 1n the preceding plot, our perceptron already converged after the sixth epoch and 
should now be able to classify the training samples perfectly. Let us implement a small convenience 
function to visualize the decision boundaries for 2D datasets: 


from matplotlib.colors import ListedColormap 
def plot decision regions(X, y, classifier, resolution=0.02): 


# setup marker generator and color map 


markers = ('s', 'x', ‘'o', '*', 'w') 
colors = ('red', 'blue', 'lightgreen', ‘'gray', 'cyan') 
cmap = ListedColormap (colors[:len(np.unique(y) ) ]) 
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# plot the decision surface 
xl min, xl_max = X[:, O].min() - 1, X[:, O].max() + 1 
x2 min, x2 max = X[:, 1].min() - 1, X[:, 1].max() + 1 


xx1, 


xx2 = np.meshgrid(np.arange(xl_ min, x1l_max, resolution), 
np.arange(x2 min, x2 max, resolution) ) 
classifier.predict(np.array([xxl.ravel(), xx2.ravel()]) .T) 


= Z.reshape (xx1.shape) 

.contourf (xxl, xx2, Z, alpha=0.4, cmap=cmap) 
.xlim(xx1l.min(), xx1l1.max() ) 

.ylim(xx2.min(), xx2.max() ) 


# plot class samples 


for 


idx, cl in enumerate (np.unique(y) ): 

plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1], 
alpha=0.8, c=cmap(idx), 
marker=markers[idx], label=cl) 


First, we define a number of colors and markers and create a color map from the list of colors via 
ListedColormap. Then, we determine the minimum and maximum values for the two features and use 
those feature vectors to create a pair of grid arrays xx1 and xx2 via the NumPy meshgrid function. 
Since we trained our perceptron classifier on two feature dimensions, we need to flatten the grid 
arrays and create a matrix that has the same number of columns as the Iris training subset so that we 
can use the predict method to predict the class labels z of the corresponding grid points. After 
reshaping the predicted class labels z into a grid with the same dimensions as xx1 and xx2, we can 
now draw a contour plot via matplotlib's contourf function that maps the different decision regions 
to different colors for each predicted class in the grid array: 


>>> 
>>> 
>>> 
>>> 
>>> 


plot decision regions(X, y, classifier=ppn) 
plt.xlabel('sepal length [cm]') 
plt.ylabel('petal length [cm]') 
plt.legend(loc='upper left') 

plt.show() 


After executing the preceding code example, we should now see a plot of the decision regions, as 
shown 1n the following figure: 
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As we can see 1n the preceding plot, the perceptron learned a decision boundary that was able to 
classify all flower samples 1n the Iris training subset perfectly. 


Note 


Although the perceptron classified the two Iris flower classes perfectly, convergence 1s one of the 
biggest problems of the perceptron. Frank Rosenblatt proved mathematically that the perceptron 
learning rule converges if the two classes can be separated by a linear hyperplane. However, if 
classes cannot be separated perfectly by such a linear decision boundary, the weights will never stop 
updating unless we set a maximum number of epochs. 
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Adaptive linear neurons and the convergence of 
learning 


In this section, we will take a look at another type of single-layer neural network: ADAptive LInear 
NEuron (Adaline). Adaline was published, only a few years after Frank Rosenblatt's perceptron 
algorithm, by Bernard Widrow and his doctoral student Tedd Hoff, and can be considered as an 
improvement on the latter (B. Widrow et al. Adaptive "Adaline" neuron using chemical 
"memistors". Number Technical Report 1553-2. Stanford Electron. Labs. Stanford, CA, October 
1960). The Adaline algorithm is particularly interesting because it illustrates the key concept of 
defining and minimizing cost functions, which will lay the groundwork for understanding more 
advanced machine learning algorithms for classification, such as logistic regression and support 
vector machines, as well as regression models that we will discuss in future chapters. 


The key difference between the Adaline rule (also known as the Widrow-Hoff rule) and Rosenblatt's 
perceptron is that the weights are updated based on a linear activation function rather than a unit step 
function like in the perceptron. In Adaline, this linear activation function o(2) is simply the identity 


Pigs i 
; PW Xx)=Ww Xx 
function of the net input so that o( . 
While the linear activation function is used for learning the weights, a guantizer, which is similar to 
the unit step function that we have seen before, can then be used to predict the class labels, as 
illustrated in the following figure: 





~ > Output 


Net input Activation Quantizer 


If we compare the preceding figure to the illustration of the perceptron algorithm that we saw earlier, 
the difference is that we know to use the continuous valued output from the linear activation function 
to compute the model error and update the weights, rather than the binary class labels. 
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Minimizing cost functions with gradient descent 


One of the key ingredients of supervised machine learning algorithms is to define an objective 
function that is to be optimized during the learning process. This objective function is often a cost 


function that we want to mimmize. In the case of Adaline, we can define the cost function J to learn 
the weights as the Sum of Squared Errors (SSE) between the calculated outcome and the true class 


How) 3 2 [0 -4l2")) 


| 


The term 4 is just added for our convenience; it will make it easier to derive the gradient, as we will 
see in the following paragraphs. The main advantage of this continuous linear activation function is— 
in contrast to the unit step function—that the cost function becomes differentiable. Another nice 
property of this cost function is that it 1s convex; thus, we can use a simple, yet powerful, optimization 
algorithm called gradient descent to find the weights that minimize our cost function to classify the 
samples in the Iris dataset. 


label 


As illustrated in the following figure, we can describe the principle behind gradient descent as 
climbing down a hill until a local or global cost minimum is reached. In each iteration, we take a step 
away from the gradient where the step size is determined by the value of the learning rate as well as 
the slope of the gradient: 


J(w) initial /; _—— Gradient 


Global cost minimum 
J. (w) 


“min ' 








Using gradient descent, we can now update the weights by taking a step away from the gradient 


Vilw Siw 
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wi w+ AW 


Here, the weight change 4" is defined as the negative gradient multiplied by the learning rate 7’: 
Aw =—nAJ(w) 


To compute the gradient of the cost a? we need to compute the partial derivative of the cost 


Be Ll 92") 


function with respect to each weight ! so that we can write 


aw, =a Shon ("-of 2) 


3 WwW, 








.,, WwW. 
the update of weight "I as: 


Since we update all weights simultaneously, our Adaline learning rule becomes “= +4 | 


Note 


For those who are familiar with calculus, the partial derivative of the SSE cost function with respect 
to the jth weight in can be obtained as follows: 


A S Ae x acu 
cl ws ( I . ( yp ~ b| lt) ) 
Ow, Ow, 2° - 








2 


5 ow —\- | 
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( _(/) | 
Although the Adaline learning rule looks identical to the perceptron rule, the \’ with 2 uP 


wx" is a real number and not an integer class label. Furthermore, the weight update is calculated 
based on all samples in the training set (instead of updating the weights incrementally after each 
sample), which is why this approach is also referred to as "batch" gradient descent. 
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Implementing an Adaptive Linear Neuron in Python 


Since the perceptron rule and Adaline are very similar, we will take the perceptron implementation 
that we defined earlier and change the fit method so that the weights are updated by minimizing the 
cost function via gradient descent: 


class AdalineGD(object): 
""MADAptive Linear NEuron classifier. 


Parameters 
eta : float 

Learning rate (between 0.0 and 1.0) 
i ce 3 Ase 

Passes over the training dataset. 


Attributes 
Ww = Jbdearray 
Weights after fitting. 
errors = A46L 
Number of misclassifications in every epoch. 


wos ov 


Cer Amit (self, ete-0,01;, f 1667-50): 
SelLisetae = Cla 
Seles tee = i eer 


def fit(self, X, y): 
we Fit training data. 


Parameters 

X 4 1array-lake), shape — [nm samples, nm features] 
Training vectors, 
where fn samples os the number of samples and 
m features 16 Che number of Teatures, 

V¥ = etray-like, shape — [Lo samples] 
Target values. 


Returns 


self : object 


Woy sv 


SscliawW = DPe«ZeroSs() « x.snape| i) 
SelLiecosu = iy 


tor 2 ah Penge (se lr. Geer): 
OHEDUG = Seliger 1p Ca) 
errors = (y - output) 
: >] += ‘ : 
Se Le ar? | Selizel. WOW abode (errors) 


self.w_[0] += self.etqawwGbhebsolcaryg () 


cost = (errors**2).sum() / 2.0 
SelLisCoOse. sa ppenc (cost) 
return self 


Get Net 1nput (Self, x): 
DUE COLeCUeLe Dee. anpue'* 
Feturn Np.cot(x, Seltew [1le)) 2 selt.w [0 


def activation(self, X): 
"Compute linear activation""" 
PSturh Seli.ner 1npuE (>) 


def predict(self, X): 
"" "Return class label after unit step"™"" 
return np.where(self.activation(X) >= 0.0, 1, -1) 


Instead of updating the weights after evaluating each individual training sample, as in the perceptron, 
we calculate the gradient based on the whole training dataset via self.eta * errors.sum() for the 
zero-weight and via self.eta * X.T.dot (errors) for the weights 1 to ”’ where 

X.T.dot (errors) 18 a matrix-vector multiplication between our feature matrix and the error vector. 
Similar to the previous perceptron implementation, we collect the cost values ina list self.cost_ to 
check 1f the algorithm converged after training. 


Note 


Performing a matrix-vector multiplication 1s similar to calculating a vector dot product where each 
row in the matrix 1s treated as a single row vector. This vectorized approach represents a more 
compact notation and results in a more efficient computation using NumPy. For example: 


fi 
3 Ix74+2x84+3x9 50 
“15: |= = 


6} | 4 1 4x745x8+6x9]} | 122] 


I 


f 
a 


In practice, it often requires some experimentation to find a good learning rate ”’ for optimal 


convergence. So, let's choose two different learning rates // ~ UT and 7 = 9.0001 5, start with and 
plot the cost functions versus the number of epochs to see how well the Adaline implementation 
learns from the training data. 


Note 


The learning rate / , as well as the number of epochs n_ iter, are the so-called hyperparameters of 
: : -., .. WOW! eBook sas 58 
the perceptron and Adaline learning algorithms, timer. Building Good Training Sets—Data 


Preprocessing, we will take a look at different techniques to automatically find the values of different 
hyperparameters that yield optimal performance of the classification model. 


Let us now plot the cost against the number of epochs for the two different learning rates: 


>>> fig, ax = plt.subplots (nrows=1, ncols=2, figsize=(8, 4)) 
>>> adal = AdalineGD(n iter=10, eta=0.01).fit(X, y) 
>>> ax[0].plot(range(1, len(adal.cost_) + 1), 

oa np.logl10(adal.cost_), marker='0o') 

>>> ax[0].set_xlabel('Epochs' ) 

>>> ax[0].set_ ylabel ('log(Sum-squared-error) ') 

>>> ax[0].set_title('Adaline - Learning rate 0.01') 
>>> ada2 = AdalineGD(n iter=10, eta=0.0001) .f1t(X, y) 
>>> ax[1].plot(range(1, len(ada2.cost_) + 1), 

wee ada2.cost_, marker='o') 

>>> ax[1].set_xlabel('Epochs' ) 

>>> ax[1].set_ylabel ('Sum-squared-error' ) 

>>> ax[1].set_title('Adaline - Learning rate 0.0001') 
>>> plt.show() 


As we can see in the resulting cost function plots next, we encountered two different types of 
problems. The left chart shows what could happen 1f we choose a learning rate that 1s too large— 
instead of minimizing the cost function, the error becomes larger 1n every epoch because we 
overshoot the global minimum: 


Adaline - Learning rate 0.01 Adaline - Learning rate 0.0001 


30 


30, 





46 


ain 
= 


Sum-squared-error 
a 
a 


log(Sum-squared-error) 


fs 
Pt 


Epochs 


Although we can see that the cost decreases whewiweolnok at the right plot, the chosen learning rate 
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1 =0.0001 i. <6 small that the algorithm would require a very large number of epochs to converge. 
The following figure illustrates how we change the value of a particular weight parameter to 


minimize the cost function / (left subfigure). The subfigure on the right illustrates what happens if we 
choose a learning rate that 1s too large, we overshoot the global minimum: 


Initial 
weight 


i 


i __ Global cost minimum 


Jig) 





Many machine learning algorithms that we will encounter throughout this book require some sort of 
feature scaling for optimal performance, which we will discuss in more detail in Chapter 3, A Tour of 
Machine Learning Classifiers Using Scikit-learn. Gradient descent is one of the many algorithms 
that benefit from feature scaling. Here, we will use a feature scaling method called standardization, 
which gives our data the property of a standard normal distribution. The mean of each feature 1s 
centered at value 0 and the feature column has a standard deviation of 1. For example, to standardize 


; . f - 

the -/ th feature, we simply need to subtract the sample mean Po from every training sample and 
ee . _. 

divide it by its standard deviation 


I x —_ oe 
Y.==_— 





oO J 


Here ~’ is a vector consisting of the -/ th feature values of all training samples ” 


Standardization can easily be achieved using the NumPy methods mean and std: 


>>> X_ std = np.copy (X) 
>>> X_std[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std() 
>>> X_ std[:,1] = (X[:,1] - X[:,1].mean() ) / X[:,1].std() 


After standardization, we will train the Adaline again and see that it now converges using a learning 


rate 7 = 0.01, 


>>> ada = AdalineGD(n iter=15, eta=0.01 
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>>> plot decision regions(xX std, y, classifier=ada) 

>>> plt.title('Adaline - Gradient Descent') 

>>> plt.xlabel('sepal length [standardized] ') 

>>> plt.ylabel('petal length [standardized] ') 

>>> plt.legend(loc='upper left') 

>>> plt.show() 

>>> plt.plot(range(1, len(ada.cost ) + 1), ada.cost_, marker='o') 
>>> plt.xlabel ('Epochs') 

>>> plt.ylabel ('Sum-squared-error' ) 

>>> plt.show() 


After executing the preceding code, we should see a figure of the decision regions as well as a plot of 
the declining cost, as shown in the following figure: 
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As we can See in the preceding plots, the Adaline now converges after training on the standardized 


features using a learning rate '7 ~ 0.01 However, note that the SSE remains non-zero even though all 
samples were classified correctly. 
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Large scale machine learning and stochastic gradient 
descent 


In the previous section, we learned how to minimize a cost function by taking a step into the opposite 
direction of a gradient that is calculated from the whole training set; this 1s why this approach 1s 
sometimes also referred to as batch gradient descent. Now imagine we have a very large dataset with 
millions of data points, which is not uncommon in many machine learning applications. Running batch 
gradient descent can be computationally quite costly in such scenarios since we need to reevaluate the 
whole training dataset each time we take one step towards the global minimum. 


A popular alternative to the batch gradient descent algorithm 1s stochastic gradient descent, 
sometimes also called iterative or on-line gradient descent. Instead of updating the weights based on 


lt) 
the sum of the accumulated errors over all samples * : 


Aw = (y" — o(2'))x”, 


We update the weights incrementally for each training sample: 


n( y —9(2)) x 


Although stochastic gradient descent can be considered as an approximation of gradient descent, it 
typically reaches convergence much faster because of the more frequent weight updates. Since each 
gradient is calculated based on a single training example, the error surface is noisier than in gradient 
descent, which can also have the advantage that stochastic gradient descent can escape shallow local 
minima more readily. To obtain accurate results via stochastic gradient descent, it 1s important to 
present 1t with data in a random order, whichis why we want to shuffle the training set for every 
epoch to prevent cycles. 


Note 


In stochastic gradient descent implementations, the fixed learning rate ”’ is often replaced by an 
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C 


| | | |i umber of I terations | +, 
adaptive learning rate that decreases over time, for example, * : - 


bs | 


where “! and “? are constants. Note that stochastic gradient descent does not reach the global 
minimum but an area very close to it. By using an adaptive learning rate, we can achieve further 
annealing to a better global minimum 


Another advantage of stochastic gradient descent 1s that we can use it for online learning. In online 
learning, our model 1s trained on-the-fly as new training data arrives. This is especially useful if we 
are accumulating large amounts of data—for example, customer data 1n typical web applications. 
Using online learning, the system can immediately adapt to changes and the training data can be 
discarded after updating the model 1f storage space in an issue. 


Note 


A compromise between batch gradient descent and stochastic gradient descent 1s the so-called mini- 
batch learning. Mini-batch learning can be understood as applying batch gradient descent to smaller 
subsets of the training data—for example, 50 samples at a time. The advantage over batch gradient 
descent is that convergence is reached faster via mini-batches because of the more frequent weight 
updates. Furthermore, mini-batch learning allows us to replace the for-loop over the training samples 
in Stochastic Gradient Descent (SGD) by vectorized operations, which can further improve the 
computational efficiency of our learning algorithm. 


Since we already implemented the Adaline learning rule using gradient descent, we only need to 
make a few adjustments to modify the learning algorithm to update the weights via stochastic gradient 
descent. Inside the fit method, we will now update the weights after each training sample. 
Furthermore, we will implement an additional partial £it method, which does not reinitialize the 
weights, for on-line learning. In order to check 1f our algorithm converged after training, we will 
calculate the cost as the average cost of the training samples in each epoch. Furthermore, we will add 
an option to shuffle the training data before each epoch to avoid cycles when we are optimizing the 
cost function; via the random state parameter, we allow the specification of a random seed for 
consistency: 


from numpy.random import seed 


class AdalineSGD(object): 
""WADAptive Linear NEuron classifier. 


Parameters 
eta : float 

Learning rate (between 0.0 and 1.0) 
iy eer 2 Ait 

Passes over the training dataset. 


Attributes WOW! eBook 
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WwW + Jod=array 
Weights after fitting. 
Stores = 1160 
Number of misclassifications in every epoch. 
shuffle : bool (default: True) 
Shuffles training data every epoch 
1f True to prevent cycles. 
rancom Stale = 1nt (dsetaules Nome) 
Set random state for shuffling 
and initializing the weights. 


wove sv 


Csi init (eelt, eta=0.0l, 1 2ter=10, 
SHULELe=lrue;, Pandom statle-None) + 
ceil. = Ged 
eolial tte = f Ice, 
sel Tey Ja ala Zed = False 
ecliso te = Cute 
i Lencom slave. 
peec(raloom.stare) 


def fit(self, X, y): 
we Fic Training data. 


Parameters 
x * dabtoy-lLike), shape = | Samples, m Teerures| 
Training vectors, where n samples 
1s the number of samples and 
i teat ures to Lhe MUMbDer OF Pealdire>. 
VY = aertay-like, shape = (nm samples) 
Target values. 


Returns 


wey vy 


Sselt.s Ifitialize welgite(xX.shape| 1) 
Seclil«COst. = |] 
Lon 3, dt Pergo sell. eer): 
if self.shuffle: 
Ay VY = SCLUL. SNureletx, ¥) 
cost = [] 
FOr Xiy Laroer am Zip tt, y): 
COStsappeno (sell. Upeate. welgnte (x1, Cargec) ) 
avg cost = sum(cost)/len(y) 
SeClLIi«sCOSt seppenatavg Cosy) 
return self 


Get Partial Tie (selt, A, VV): 
menPit training data without reinitializing the weights"™"" 
1f not self.w initialized: Wow! eBook 
www.wowebook.org 


Selts dv viilalLiz7e. welgice (x.snepel) 
1f y.ravel().shape[0O] > 1: 
FOr Ki, Taroet an Zip (%, yy): 
Sselit. Update welgnts (x1, Larger) 
else: 
self. update weights (X, y) 
return self 


CCl . Ssnubt be (scity, xy VY): 
weMShurrle Eraining data’™™ 
r = np.random.permutation(len(y) ) 
return X[r], yl[rl 


cet AMLtleli7e We1gnte (seli, mM). 
uuntnitialize weights to zeros""™" 
Sseli aw. = Mp.Z2Zeros(s. = mM) 

Sselt.W danitialized = True 


aet Updare weirgnts(selt, Xi, Ttarger) 
"""Aoply Adaline learning rule to update the weights""™" 


OUTDUL = Selt.nel tnpur (x1. 
error = (target - output) 
selt.sw [Lis] t= Selt.eta: * 21.00tC(err or) 


| += selt.cta * error 
BY Grror**7 
st 


[ 1 
self.w [0 
cost = 0. 

recrurn Co 


(eb her Jpuctserr, 7) 5 
wee Cale. Dee aipuce 
FStCUIn Np,sooLr(x, Selit.w filet) - selit.w |0] 


def activation(self, X): 
mY COMpPULeG Janear activation” ”™™ 
Pella, Ser aiee tape) 


def predict(self, X): 
"""Return class label after unit step""" 
return np.where(self.activation(X) >= 0.0, 1, -1) 


The shuffle method that we are now using in the Adalinescp classifier works as follows: via the 
permutation function in numpy. random, we generate a random sequence of unique numbers in the 
range 0 to 100. Those numbers can then be used as indices to shuffle our feature matrix and class 
label vector. 


We can then use the fit method to train the Adalinescp classifier and use our 
plot decision regions to plot our training results: 


27> Goa. = AcCaline GD (i 1 ter—lo, ela-0.01, random Sstate=i1) 
27? BOa.TIC(X StG, VY) 

Per Dio CeCe Fog vole eG, VV, © leset ter —ocd) 

Por Dive title." Adalaine — StOChastic Gradvent Descent") 
>>> plt.xlabel('sepal length [standardized] ' , 


Per Did. Vlabel (*pertat engrn [standabaMZé gak 
ww.wowe ok org 


>>> plt.legend(loc='upper left") 
>>> plt.show () 


ao Dileep OL (range (ly Len(ada,Cost |) & Ii, @decCOst » Marker —' Oo") 
Per DIU. x<eabe! (*BpoOchs ) 

>>> plt.ylabel ('Average Cost') 

>>> plt.show () 


The two plots that we obtain from executing the preceding code example are shown 1n the following 
figure: 





Adaline - Stochastic Gradient Descent 0.25 


0.20 


Average Cost 
eS 
- 
Ll 


petal length [standardized] 
& 
indi 
= 


0.05 





—? -] i) l Z 3 
sepal length [standardized] 





As we can see, the average cost goes down pretty quickly, and the final decision boundary after 15 
epochs looks similar to the batch gradient descent with Adaline. If we want to update our model—for 
example, in an on-line learning scenario with streaming data—we could simply call the 

partial fit method on individual samples—for instance, ada.partial fit(X std[0, :], 
y[Q]). 
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Summary 


In this chapter, we gained a good understanding of the basic concepts of linear classifiers for 
supervised learning. After we implemented a perceptron, we saw how we can train adaptive linear 
neurons efficiently via a vectorized implementation of gradient descent and on-line learning via 
stochastic gradient descent. Now that we have seen how to implement simple classifiers in Python, 
we are ready to move on to the next chapter where we will use the Python scikit-learn machine 
learning library to get access to more advanced and powerful off-the-shelf machine learning 
classifiers that are commonly used in academia as well as 1n industry. 
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Chapter 3. A Tour of Machine Learning 
Classifiers Using Scikit-learn 


In this chapter, we will take a tour through a selection of popular and powerful machine learning 
algorithms that are commonly used 1n academia as well as in the industry. While learning about the 
differences between several supervised learning algorithms for classification, we will also develop 
an intuitive appreciation of their individual strengths and weaknesses. Also, we will take our first 
steps with the scikit-learn library, which offers a user-friendly interface for using those algorithms 
efficiently and productively. 


The topics that we will learn about throughout this chapter are as follows: 


e Introduction to the concepts of popular classification algorithms 
e Using the scikit-learn machine learning library 
e Questions to ask when selecting a machine learning algorithm 
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Choosing a classification algorithm 


Choosing an appropriate classification algorithm for a particular problem task requires practice: each 
algorithm has its own quirks and 1s based on certain assumptions. To restate the "No Free Lunch" 
theorem: no single classifier works best across all possible scenarios. In practice, it is always 
recommended that you compare the performance of at least a handful of different learning algorithms 
to select the best model for the particular problem; these may differ 1n the number of features or 
samples, the amount of noise 1n a dataset, and whether the classes are linearly separable or not. 


Eventually, the performance of a classifier, computational power as well as predictive power, 
depends heavily on the underlying data that are available for learning. The five main steps that are 
involved in training a machine learning algorithm can be summarized as follows: 


1. Selection of features. 

. Choosing a performance metric. 

. Choosing a classifier and optimization algorithm. 
. Evaluating the performance of the model. 

. Tuning the algorithm. 


© B&B W N 


Since the approach of this book is to build machine learning knowledge step by step, we will mainly 
focus on the principal concepts of the different algorithms in this chapter and revisit topics such as 
feature selection and preprocessing, performance metrics, and hyperparameter tuning for more 
detailed discussions later 1n this book. 
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First steps with scikit-learn 


In Chapter 2, Training Machine Learning Algorithms for Classification, you learned about two 
related learning algorithms for classification: the perceptron rule and Adaline, which we 
implemented in Python by ourselves. Now we will take a look at the scikit-learn API, which 
combines a user-friendly interface with a highly optimized implementation of several classification 
algorithms. However, the scikit-learn library offers not only a large variety of learning algorithms, but 
also many convenient functions to preprocess data and to fine-tune and evaluate our models. We will 
discuss this 1n more detail together with the underlying concepts in Chapter 4, Building Good 
Training Sets — Data Preprocessing, and Chapter 5, Compressing Data via Dimensionality 
Reduction. 
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Training a perceptron via scikit-learn 


To get started with the scikit-learn library, we will train a perceptron model similar to the one that we 
implemented in Chapter 2, Training Machine Learning Algorithms for Classification. For 
simplicity, we will use the already familiar Iris dataset throughout the following sections. 
Conveniently, the Iris dataset is already available via scikit-learn, since it is a simple yet popular 
dataset that is frequently used for testing and experimenting with algorithms. Also, we will only use 
two features from the Iris flower dataset for visualization purposes. 


We will assign the petal length and petal width of the 150 flower samples to the feature matrix x and 
the corresponding class labels of the flower species to the vector y: 


>>> from sklearn import datasets 
>>> import numpy as np 

ao Eee. = Cole seeerlOad, dri1e-() 
>>> X = iris.data[:, [2, 3]] 

2 YY = 1PisS.carger 


If we executed np. unique (y) to return the different class labels stored in iris.target, we would 
see that the Iris flower class names, /ris-Setosa, Iris-Versicolor, and Iris-Virginica, are already 
stored as integers (0, 1, 2), whichis recommended for the optimal performance of many machine 
learning libraries. 


To evaluate how well a trained model performs on unseen data, we will further split the dataset into 
separate training and test datasets. Later in Chapter 5, Compressing Data via Dimensionality 
Reduction, we will discuss the best practices around model evaluation in more detail: 


ear EEOMm SxieathsCrOoss Validation ampore Liain test. split 
27 Kk Clainy, x. Test, VY train, Y est. = train Tes. split 
xy Vy LESt S1Z26=0.50,; tandom Stace=)) 


Using the train test split function from scikit-learn's cross validation module, we randomly 
split the x and y arrays into 30 percent test data (45 samples) and 70 percent training data (105 


samples). 


Many machine learning and optimization algorithms also require feature scaling for optimal 
performance, as we remember from the gradient descent example in Chapter 2, Training Machine 
Learning Algorithms for Classification. Here, we will standardize the features using the 
StandardScaler class from scikit-learn's preprocessing module: 


>>> from sklearn.preprocessing import StandardScaler 
>>> sc = StandardScaler() 

Pe Sez LL. tia) 

ver & Train. SEO = SCetransrorm( x Tiaim) 

ore he UeSe SiC = SC.Liano Orm( x Tee) 


Using the preceding code, we loaded the stangaydscaer class from the preprocessing module and 
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initialized a new StandardScaler object that we assigned to the variable sc. Using the fit method, 


StandardScaler estimated the parameters ““ (sample mean) and © (standard deviation) for each 
feature dimension from the training data. By calling the transform method, we then standardized the 


training data using those estimated parameters “‘ and 7 . Note that we used the same scaling 
parameters to standardize the test set so that both the values in the training and test dataset are 
comparable to each other. 


Having standardized the training data, we can now train a perceptron model. Most algorithms in 
scikit-learn already support multiclass classification by default via the One-vs.-Rest (OvR) method, 
which allows us to feed the three flower classes to the perceptron all at once. The code is as follows: 


eo? from Skiecarn. linear mocgel amport Perceptron 
2e> pp = Perceptron(n ster=40, etrav=0..1, random State=)) 
2? PPPs C(x Crain Sed, “7 Cia) 


The scikit-learn interface reminds us of our perceptron implementation in Chapter 2, Training 
Machine Learning Algorithms for Classification: after loading the Perceptron class from the 
linear model module, we initialized a new Perceptron object and trained the model via the fit 
method. Here, the model parameter eta0 1s equivalent to the learning rate eta that we used in our 
own perceptron implementation, and the parameter n_ iter defines the number of epochs (passes 
over the training set). As we remember from Chapter 2, Training Machine Learning Algorithms for 
Classification, finding an appropriate learning rate requires some experimentation. If the learning 
rate is too large, the algorithm will overshoot the global cost minimum. If the learning rate 1s too 
small, the algorithm requires more epochs until convergence, which can make the learning slow— 
especially for large datasets. Also, we used the random state parameter for reproducibility of the 
initial shuffling of the training dataset after each epoch. 


Having trained a model in scikit-learn, we can make predictions via the predict method, just like in 
our own perceptron implementation in Chapter 2, 7raining Machine Learning Algorithms for 
Classification. The code is as follows: 


eo? VY pred = ppn.predrcer (x Lest 86d) 
eo? Print" Miselassit ied. Samples: 20" = (7 test != ¥ prec) «sum{):) 
Misclassified samples: 4 


On executing the preceding code, we see that the perceptron misclassifies 4 out of the 45 flower 
(4/45 = 0.089) 


samples. Thus, the misclassification error on the test dataset 1s 0.089 or 8.9 percent 


Note 


Instead of the misclassification error, many machine learning practitioners report the classification 
accuracy of a model, which is simply calculated as follows: 
| - misclassification error = 0.911 or 91.1 percent. 
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Scikit-learn also implements a large variety,af. diuthersialpegformance metrics that are available via 


the metrics module. For example, we can calculate the classification accuracy of the perceptron on 
the test set as follows: 


Po? LOM. Sk Learn. melr ics. 2NpOre. accuracy SCOre 
PoP PLING" AeCUuracy:; cazk” @ accuracy Score(y test, 7 _ pred) 
Oe Od 


Here, y test are the true class labels and y pred are the class labels that we predicted previously. 


Note 


Note that we evaluate the performance of our models based on the test set in this chapter. In Chapter 
5, Compressing Data via Dimensionality Reduction, you will learn about useful techniques, 
including graphical analysis such as learning curves, to detect and prevent overfitting. Overfitting 
means that the model captures the patterns 1n the training data well, but fails to generalize well to 
unseen data. 


Finally, we can use our plot decision regions function from Chapter 2, Training Machine 
Learning Algorithms for Classification, to plot the decision regions of our newly trained perceptron 
model and visualize how well it separates the different flower samples. However, let's add a small 
modification to highlight the samples from the test dataset via small circles: 


From Matplorlib.colors import ListedcColormap 
PPO IMatplotlib.pypLot as paLlT 


Ger plore Cecile) Peqtone(s, Vr, @lassii ter, 
Lest. 2Ox=NoOne, tesolur10n=—U.02) > 


# setup marker generator and color map 


Markers = ("Ss"), "Rs, “Org 9% y "VF" ) 
colors = ('red', 'blue', ‘'lightgreen', ‘'gray', ‘'cyan') 
cmap = ListedColormap(colors[:len(np.unique(y) )]) 


# plot the decision surface 


xi Min, x) Wax = Aleg Vilemant) = 1, Xie, OUleamax() + 7: 

x2 Min, 22 Mex = xls; Liemant) — Ly Aley tistmax() + 2 

xhl, MX2 = Npw.meshngrid(np.eatange (xl min, xl Max; Pesolurion), 
NpwarenGe (XZ Mit, XZ Max, resolution) ) 

Z = Classifier.predict(np.array([xxl.ravel(), xx2.ravel()]).T) 

Z= Z2.reshape(xxl.shape) 


plt.contourf (xxl, xx2, Z, alpha=0.4, cmap=cmap) 
plt.xlim(xxl.min(), xxl.max() ) 
plt.ylim(xx2.min(), xXx2.max() ) 


# plot all samples 
x test, Y test — wives. 10x, ele Vilest. cx] 
for idx, cl in enumerate (np.unique(y)): 
plt.scatter(x=X[y == cl, OJ], y=X[y == cl, 1], 
alpha=0.8, c=cmap(idx), 
marker=markers[idx], label=cl1) 
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# highlight test samples 
if test idx: 
X test, y test = X[test idx, :], y[test_idx] 
plt.scatter(X test[:, 0], X_test[:, 1], c='', 
alpha=1.0, linewidth=1, marker='o', 
S=55, label='test set') 


With the slight modification that we made to the plot decision regions function (highlighted in 
the preceding code), we can now specify the indices of the samples that we want to mark on the 
resulting plots. The code is as follows: 


oer a, COMbaned Sud = hpevserack((. Tica Seo, % esr 2.0) 
Per y Combined = np.nsrack((y train, y Test) ) 
ee DILOU Gdecmet0n 2eg OnSs(X=% Combed S.o0; 
y=y_ combined, 
classifier=ppn, 
“es Leste tOx—range (1.05, 150). 
>>> plt.xlabel('petal length [standardized] ') 
>>> plt.ylabel('petal width [standardized] ') 
>>> plt.legend(loc='upper left") 
>>> plt.show () 


As we can see in the resulting plot, the three flower classes cannot be perfectly separated by a linear 
decision boundaries: 


eGe 0 
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We remember from our discussion in Chapter 2, Training Machine Learning Algorithms for 
Classification, that the perceptron algorithm never converges on datasets that aren't perfectly linearly 
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separable, which 1s why the use of the perceptron algorithm is typically not recommended 1n practice. 
In the following sections, we will look at more powerful linear classifiers that converge to a cost 
minimum even if the classes are not perfectly linearly separable. 


Note 


The perceptron as well as other scikit-learn functions and classes have additional parameters that 
we omit for clarity. You can read more about those parameters using the help function in Python (for 
example, help (Perceptron) ) or by going through the excellent scikit-learn online documentation at 


http://scikit-learn.org/stable/. 
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Modeling class probabilities via logistic 
regression 


Although the perceptron rule offers a nice and easygoing introduction to machine learning algorithms 
for classification, its biggest disadvantage is that it never converges if the classes are not perfectly 
linearly separable. The classification task in the previous section would be an example of such a 
scenario. Intuitively, we can think of the reason as the weights are continuously being updated since 
there 1s always at least one misclassified sample present 1n each epoch. Of course, you can change the 
learning rate and increase the number of epochs, but be warned that the perceptron will never 
converge on this dataset. To make better use of our time, we will now take a look at another simple 
yet more powerful algorithm for linear and binary classification problems: logistic regression. Note 
that, in spite of its name, logistic regression is a model for classification, not regression. 
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Logistic regression intuition and conditional probabilities 


Logistic regression 1s a classification model that is very easy to implement but performs very well on 
linearly separable classes. It is one of the most widely used algorithms for classification in industry. 
Similar to the perceptron and Adaline, the logistic regression model in this chapter is also a linear 
model for binary classification that can be extended to multiclass classification via the OvR 
technique. 


To explain the idea behind logistic regression as a probabilistic model, let's first introduce the odds 
—_t 


ratio, which is the odds in favor of a particular event. The odds ratio can be written as iP) 


where ” stands for the probability of the positive event. The term positive event does not 
necessarily mean good, but refers to the event that we want to predict, for example, the probability 


that a patient has a certain disease; we can think of the positive event as class label ’ =!. We can 
then further define the logit function, which is simply the logarithm of the odds ratio (log-odds): 


P 


(1— p) 


logit ( p) = log 


The logit function takes input values in the range 0 to | and transforms them to values over the entire 
real number range, which we can use to express a linear relationship between feature values and the 
log-odds: 


i 
r if ; i, 5 1 — ' a bt = Lk a — k a aaa ' r = 
logit ( p(y =1| x)) = wyxy + WX, + WX, = DW X_ = WX 


I=) 


p(y=1|x) 


Here, is the conditional probability that a particular sample belongs to class | given its 


features x. 


Now what we are actually interested in 1s predicting the probability that a certain sample belongs to a 
particular class, which is the inverse form of the logit function. It 1s also called the /ogistic function, 
sometimes simply abbreviated as sigmoid function due to its characteristic S-shape. 
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Here, z 1s the net input, that is, the linear combination of weights and sample features and can be 


calculated as 


7" — Py — wy ¥ oan 4, ¥ 
Z=WX=W+tWwX, te + wx 


Now let's simply plot the sigmoid function for some values in the range -7 to 7 to see what it looks 


like: 


>>> 
Loe 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
Por 
>>> 
>>> 
>>> 
>>> 


Import. MatpLotiib.pyp bol as pic 
import numpy as np 
def sigmoid(z): 
return 1.0 / (1.0 + np.exp(-z)) 
Z= np.arange(-7, 7/7, O.1) 
Dat 2 = S1.0mo1c(Z) 
DLL MOn Zz, oe Z. ) 
plt.axvline (0 color='kK') 
Sepa neon. : 1.0, facecolor='1.0', alpha=1.0, ls='dotted') 
plt.axhline(y=0.5, ls='dotted', color='k') 
Dit.VeLrcks( (0.0, U.c, 1.0]) 
plc. Vim (0.21 1) 
Divsxlabel('zZ*) 
DLteViebelit> pha (2) >>.) 
plt.show() 


is 


As a result of executing the previous code example, we should now see the S-shaped (sigmoidal) 
curve: 
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6(2) 


We can see that approaches | if z goes towards infinity (= —* ™ ), since © " becomes very 


small for large values of z. Similarly, b(2) goes towards 0 for = —* —™ as the result of an 

increasingly large denominator. Thus, we conclude that this sigmoid function takes real number values 
| | | | iz =65 

as input and transforms them to values in the range [0, |] with an intercept at #{z)=0 . 


To build some intuition for the logistic regression model, we can relate it to our previous Adaline 
implementation in Chapter 2, Training Machine Learning Algorithms for Classification. In Adaline, 


we used the identity function #\2)=2 as the activation function. In logistic regression, this activation 
function simply becomes the sigmoid function that we defined earlier, which is illustrated 1n the 
following figure: 





[a | = Net input Sigmoid duet 
ass (Wn) function function 

— 

Hao 


The output of the sigmoid function is then interpreted as the probability of particular sample 


b(z) =P(y — | 


o(z)=0.8 


weights w. For example, if we compute ~ for a particular flower sample, it means that the 
chance that this sample 1s an Iris-Versicolor flower is 80 percent. Similarly, the probability that this 


P(y =0|x;w)=1-P(y=0|x;w) =0.2 





x; w) 


belonging to class | , given its features x parameterized by the 


flower 1s an Iris-Setosa flower can be calculated as or 20 
percent. The predicted probability can then simply be converted into a binary outcome via a quantizer 
(unit step function): 


| 0 allies 


If we look at the preceding sigmoid plot, this is equivalent to the following: 
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l 220.0 


Q otherwise 


In fact, there are many applications where we are not only interested in the predicted class labels, but 
where estimating the class-membership probability is particularly useful. Logistic regression is used 
in weather forecasting, for example, to not only predict if it will rain on a particular day but also to 
report the chance of rain. Similarly, logistic regression can be used to predict the chance that a patient 
has a particular disease given certain symptoms, which is why logistic regression enjoys wide 
popularity in the field of medicine. 
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Learning the weights of the logistic cost function 


You learned how we could use the logistic regression model to predict probabilities and class labels. 
Now let's briefly talk about the parameters of the model, for example, weights w. In the previous 
chapter, we defined the sum-squared-error cost function: 


11) =ZL(o(2?)-»"} 


We minimized this in order to learn the weights w for our Adaline classification model. To explain 
how we can derive the cost function for logistic regression, let's first define the likelihood L that we 
want to maximize when we build a logistic regression model, assuming that the individual samples in 
our dataset are independent of one another. The formula is as follows: 


L(w)=P(y|xiw)=[] P(! xm) =(0(2°)) (I-9(2°)) 


In practice, it is easier to maximize the (natural) log of this equation, which 1s called the log- 
likelihood function: 


(9) =Toa(w) = ¥ te o(2") (1-3 alto 2") 


Firstly, applying the log function reduces the potential for numerical underflow, which can occur if the 
likelihoods are very small. Secondly, we can convert the product of factors into a summation of 
factors, which makes it easier to obtain the derivative of this function via the addition trick, as you 
may remember from calculus. 


Now we could use an optimization algorithm such as gradient ascent to maximize this log-likelihood 


function. Alternatively, let's rewrite the log-likelihood as a cost function / that can be minimized 
using gradient descent as in Chapter 2, 7raining Machine Learning Algorithms for Classification: 
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To get a better grasp on this cost function, let's take a look at the cost that we calculate for one single- 
sample instance: 


1(9(z).y:w) =—vlog(¢(z))-(1-»)log(1-9(z)) 


r 


v 


Looking at the preceding equation, we can see that the first term becomes zero if -’ ~ , and the 


second term becomes zero if ” =! , respectively: 


. _ {log (¢(z)) if y=1 
NOE) )=) Lo -g(2)) Peni 


The following plot illustrates the cost for the classification of a single-sample instance for different 


(2). 


values of 9 
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We can see that the cost approaches 0 (plain blue line) 1f we correctly predict that a sample belongs 
to class 1. Similarly, we can see on the y axis that the cost also approaches 0 1f we correctly predict 
v=0 (dashed line). However, if the prediction is wrong, the cost goes towards infinity. The moral is 
that we penalize wrong predictions with an increasingly larger cost. 
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Training a logistic regression model with scikit-learn 


If we were to implement logistic regression ourselves, we could simply substitute the cost function J 
in our Adaline implementation from Chapter 2, Training Machine Learning Algorithms for 
Classification, by the new cost function: 


J ej= “2 ‘log(¢ bz" ’))+(1-» og (1-9(- fy) 


This would compute the cost of classifying all training samples per epoch and we would end up with 
a working logistic regression model. However, since scikit-learn implements a highly optimized 
version of logistic regression that also supports multiclass settings off-the-shelf, we will skip the 
implementation and use the sklearn.linear model.LogisticRegression Class as well as the 
familiar fit method to train the model on the standardized flower training dataset: 


27 TEOm SkLGarnsdinear MOCel JMeOre LOoOgLesrtaCRegression 
PoP Jit = WOGLSTLCReGress10n(C=1000.0, tandom statve—0) 
Por iia Ete Blain ea, 1 Crain) 

22> PIO CeSCISTOn regions (xX Combined Sia, 

¥y Combined, -Classifier=i7, 

S24 test. 20x=fange (105, 130): ) 
>>> plt.xlabel('petal length [standardized] ') 
>>> plt.ylabel('petal width [standardized] ') 
>>> plt.legend(loc='upper left") 
>>> plt.show() 


After fitting the model on the training data, we plotted the decision regions, training samples and test 
samples, as shown here: 
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Looking at the preceding code that we used to train the LogisticRegression model, you might now 
be wondering, "What is this mysterious parameter c?" We will get to this ina second, but let's briefly 
go over the concept of overfitting and regularization 1n the next subsection first. 


Furthermore, we can predict the class-membership probability of the samples via the 
predict proba method. For example, we can predict the probabilities of the first Iris-Setosa 
sample: 


ee? Jee PreClee proba (xX Tes. occ (0,21) 
This returns the following array: 


array([[ 0.000, 0406S, Usoor4 |) 


The preceding array tells us that the model predicts a chance of 93.7 percent that the sample belongs 
to the Iris- Virginica class, and a 6.3 percent chance that the sample 1s a Iris-Versicolor flower. 


We can show that the weight update in logistic regression via gradient descent 1s indeed equal to the 
equation that we used in Adaline in Chapter 2, 7raining Machine Learning Algorithms for 
Classification. Let's start by calculating the partial derivative of the log-likelihood function with 
respect to the jth weight: 
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—i(w) = y5-(-y) 5} ole) 
hd | a) Cw. 


Ji 


Before we continue, let's calculate the partial derivative of the sigmoid function first: 


( A GC | | Pi = | [1 - | | 
Ow, ! Oel+e* (l+e* ) Lee lace 


= ¢(z)(1-¢(z)) 














‘i 





Now we can resubstitute ©” _ (2) ( =#i z)) in our first equation to obtain the following: 














ple l = | z) Ow, 
| 1 : A 
=| ys: if Vy). 7\(1- A(z _ 


Remember that the goal is to find the weights that maximize the log-likelihood so that we would 
perform the update for each weight as follows: 


ay) *— 4a) ; Ai) = (i) | i (7) 
w.i=w,t n>. (3 bz ) x 
i=l 
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Since we update all weights simultaneously, we can write the general update rule as follows: 


wi w+ AW 


We define 4” as follows: 


Aw =7V1(w) 


Since maximizing the log-likelihood is equal to minimizing the cost function / that we defined 
earlier, we can write the gradient descent update rule as follows: 


OJ : ay\ (i) 
Aw, — —T) = = n> (y" _ D ( Pa x' r) 
) Ow, ae } 


wi=wtAw, Aw=—7VJ( Ww’) 


This is equal to the gradient descent rule in Adaline in Chapter 2, 7raining Machine Learning 
Algorithms for Classification. 
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Tackling overfitting via regularization 


Overfitting is a common problem in machine learning, where a model performs well on training data 
but does not generalize well to unseen data (test data). Ifa model suffers from overfitting, we also say 
that the model has a high variance, which can be caused by having too many parameters that lead to a 
model that 1s too complex given the underlying data. Similarly, our model can also suffer from 
underfitting (high bias), which means that our model 1s not complex enough to capture the pattern in 
the training data well and therefore also suffers from low performance on unseen data. 


Although we have only encountered linear models for classification so far, the problem of overfitting 
and underfitting can be best illustrated by using a more complex, nonlinear decision boundary as 
shown 1n the following figure: 





Underfitting “2 Good x; Overfitting  *1 
(high bias) compromise (high variance) 








Note 


Variance measures the consistency (or variability) of the model prediction for a particular sample 
instance if we would retrain the model multiple times, for example, on different subsets of the training 
dataset. We can say that the model is sensitive to the randomness in the training data. In contrast, bias 
measures how far off the predictions are from the correct values in general if we rebuild the model 
multiple times on different training datasets; bias is the measure of the systematic error that is not due 
to randomness. 


One way of finding a good bias-variance tradeoff is to tune the complexity of the model via 
regularization. Regularization 1s a very useful method to handle collinearity (high correlation among 
features), filter out noise from data, and eventually prevent overfitting. The concept behind 
regularization is to introduce additional information (bias) to penalize extreme parameter weights. 
The most common form of regularization 1s the so-called L2 regularization (sometimes also called 
L2 shrinkage or weight decay), which can be written as follows: 
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Here, * is the so-called regularization parameter. 


Note 


Regularization is another reason why feature scaling such as standardization is important. For 
regularization to work properly, we need to ensure that all our features are on comparable scales. 


In order to apply regularization, we just need to add the regularization term to the cost function that 
we defined for logistic regression to shrink the weights: 


1()=| 


MW’ 


>(- log(¢(2"))+(1 - y)}(-tog(1 -4(2°))) eS 


i=l 











Via the regularization parameter 4 we can then control how well we fit the training data while 


keeping the weights small. By increasing the value of A we increase the regularization strength. 


The parameter c that is implemented for the LogisticRegression Class in scikit-learn comes froma 
convention in support vector machines, which will be the topic of the next section. c is directly 


related to the regularization parameter 4 which is its inverse: 


So we can rewrite the regularized cost function of logistic regression as follows: 


=“ 


J(w)=C xI(- log(¢(2"))+(1-y" )}(- log (1-4(z"" )) +S 








Consequently, decreasing the value of the inverse regularization parameter c means that we are 
increasing the regularization strength, which we can visualize by plotting the L2 regularization path 
for the two weight coefficients: 


>>> weights, params = [], [] 
eo £Or © dai Np.arenge (=o, 5) % 
iy = GOGLSTLCReGresslOMm(=10%"%cC, Landom Slare=U) 


LigLierke Chat Sed, VY Lean) 
weights.append(lr.coef [1]) WOW! eBook 
params .append(10**c) www.wowebook.org 


>>> weights = np.array(weights) 

>>> plt.plot(params, weights[:, O], 

Sane label='petal length") 

>>> plt.plot(params, weights[:, 1], linestyle='--', 
a4 label='petal width') 

>>> plt.ylabel ('weight coefficient’) 

Per Dil. cao Le Cc’ 

>>> plt.legend(loc='"upper left') 

Poe Ol. <SCale(* L690" } 

>>> plt.show() 


By executing the preceding code, we fitted ten logistic regression models with different values for the 
inverse-regularization parameter c. For the purposes of illustration, we only collected the weight 
coefficients of the class 2 vs. all classifier. Remember that we are using the OvR technique for 
multiclass classification. 


As we can see 1n the resulting plot, the weight coefficients shrink if we decrease the parameter C, that 
is, if we increase the regularization strength: 


weight coefficient 





Note 


Since an in-depth coverage of the individual classification algorithms exceeds the scope of this book, 
I warmly recommend Dr. Scott Menard's Logistic Regression: From Introductory to Advanced 
Concepts and Applications, Sage Publications, to readers who want to learn more about logistic 
regression. 
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Maximum margin classification with support 
vector machines 


Another powerful and widely used learning algorithm is the support vector machine (SVM), which 
can be considered as an extension of the perceptron. Using the perceptron algorithm, we minimized 
misclassification errors. However, in SVMs, our optimization objective 1s to maximize the margin. 
The margin is defined as the distance between the separating hyperplane (decision boundary) and the 
training samples that are closest to this hyperplane, which are the so-called support vectors. This is 
illustrated in the following figure: 


Support vectors 


Decision boundary 
w'x=0 


“positive” 
hyperplane 
w'x=1 


“negative” 
hyperplane 
w'x=-1 
a 
SVM: ; 
Maximize the margin 


Which hyperplane? 
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Maximum margin intuition 


The rationale behind having decision boundaries with large margins is that they tend to have a lower 
generalization error whereas models with small margins are more prone to overfitting. To get an 
intuition for the margin maximization, let's take a closer look at those positive and negative 
hyperplanes that are parallel to the decision boundary, which can be expressed as follows: 


wtwx =1 (1) 


pos 
, >a 
Wy +W X,.. = | (2) 


If we subtract those two linear equations (1) and (2) from each other, we get: 


=> W" (Xo —Xnee ) =2 


We can normalize this by the length of the vector w, which is defined as follows: 





So we arrive at the following equation: 


r 
W (x pos 7 X neg 2 


Ma 











; 


The left side of the preceding equation can then be interpreted as the distance between the positive 
and negative hyperplane, which is the so-called margin that we want to maximize. 
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a 


| 








Now the objective function of the SVM becomes the maximization of this margin by maximizing 
under the constraint that the samples are classified correctly, which can be written as follows: 


witwx >lif y=] 
; T WW aoe) 
woe xe Sel iy yp ==] 


These two equations basically say that all negative samples should fall on one side of the negative 
hyperplane, whereas all the positive samples should fall behind the positive hyperplane. This can 
also be written more compactly as follows: 


yp (1, tw x? 21; 


cs 


| 2 

—|w 
In practice, though, it is easier to minimize the reciprocal term 2 | , which can be solved by 
quadratic programming. However, a detailed discussion about quadratic programming is beyond the 
scope of this book, but if you are interested, you can learn more about Support Vector Machines 
(SVM) in Vladimir Vapnik's The Nature of Statistical Learning Theory, Springer Science & 
Business Media, or Chris J.C. Burges' excellent explanation in A Tutorial on Support Vector 
Machines for Pattern Recognition (Data mining and knowledge discovery, 2(2):121—167, 1998). 
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Dealing with the nonlinearly separable case using slack 
variables 


Although we don't want to dive much deeper into the more involved mathematical concepts behind the 


margin classification, let's briefly mention the slack variable S It was introduced by Vladimir 
Vapnik in 1995 and led to the so-called soft-margin classification. The motivation for introducing the 


slack variable © was that the linear constraints need to be relaxed for nonlinearly separable data to 
allow convergence of the optimization in the presence of misclassifications under the appropriate 
cost penalization. 


The positive-values slack variable is simply added to the linear constraints: 


wx > Lif yO =1-20 
we x! < =| if y ai 14 eli) 


So the new objective to be minimized (subject to the preceding constraints) becomes: 


2 LC yet 








sl 


Using the variable c, we can then control the penalty for misclassification. Large values of c 
correspond to large error penalties whereas we are less strict about misclassification errors if we 
choose smaller values for c. We can then we use the parameter c to control the width of the margin 
and therefore tune the bias-variance trade-off as illustrated in the following figure: 
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Large value for Small value for 
parameter C parameter C 





This concept is related to regularization, which we discussed previously in the context of regularized 
regression where increasing the value of c increases the bias and lowers the variance of the model. 


Now that we learned the basic concepts behind the linear SVM, let's train a SVM model to classify 
the different flowers 1n our Iris dataset: 


>>> from sklearn.svm import SVC 

Jeo Sym = pVC Kernel=" linear’, C=LlLe0, Landon Sslace—0) 
Per SVMeT Lei Crain Std, YY tian) 
Pe PLOL Cecisi10n, Feq10ns (x Combined sid, 

Y COMbimed, Class 116 svi, 

er best. 1dx=renge (105,150) ) 
>>> plt.xlabel('petal length [standardized] ') 
>>> plt.ylabel ('petal width [standardized] ') 
>>> plt.legend(loc="upper left') 
>>> plt.show() 


The decision regions of the SVM visualized after executing the preceding code example are shown in 
the following plot: 


WOW! eBook 
www.wowebook.org 


afe 0 
xXx 1 
o8o 2 
OOo test set 


as) 
a 
mi 
z 
a 
as) 
= 
ge 
a 
Pa 
Pcs 
he! 
= 
= 
g 
a 
o 


1 
petal length [standardized] 





Note 
Logistic regression versus SVM 


In practical classification tasks, linear logistic regression and linear SVMs often yield very similar 
results. Logistic regression tries to maximize the conditional likelihoods of the training data, which 
makes it more prone to outliers than SVMs. The SVMs mostly care about the points that are closest to 
the decision boundary (support vectors). On the other hand, logistic regression has the advantage that 
it is a simpler model that can be implemented more easily. Furthermore, logistic regression models 
can be easily updated, which is attractive when working with streaming data. 
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Alternative implementations in scikit-learn 


The perceptron and LogisticRegression Classes that we used 1n the previous sections via scikit- 
learn make use of the LIBLINEAR library, which is a highly optimized C/C++ library developed at 
the National Taiwan University (http://www.csie.ntu.edu.tw/~cjlin/liblinear/). Similarly, the svc 
class that we used to train an SVM makes use of LIBSVM, which is an equivalent C/C++ library 


specialized for SVMs (http://www.csie.ntu.edu.tw/~cjlin/libsvin/). 


The advantage of using LIBLINEAR and LIBSVM over native Python implementations is that they 
allow an extremely quick training of large amounts of linear classifiers. However, sometimes our 
datasets are too large to fit into computer memory. Thus, scikit-learn also offers alternative 
implementations via the sGDClassifier class, which also supports online learning via the 
partial fit method. The concept behind the scbclassifier class is similar to the stochastic 
eradient algorithm that we implemented in Chapter 2, Training Machine Learning Algorithms for 
Classification, for Adaline. We could initialize the stochastic gradient descent version of the 
perceptron, logistic regression, and support vector machine with default parameters as follows: 


27> TEOMm skiecari«tinear model ampore SGDC lassiitier 
>>> ppn = SGDClassifier(loss='perceptron') 

>>> lr = SGDClassifier(loss='log') 

>>> svm = SGDClassifier(loss='hinge') 
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Solving nonlinear problems using a kernel 
SVM 


Another reason why SVMs enjoy high popularity among machine learning practitioners is that they 
can be easily kernelized to solve nonlinear classification problems. Before we discuss the main 
concept behind kernel SVM, let's first define and create a sample dataset to see how such a nonlinear 
classification problem may look. 


Using the following code, we will create a simple dataset that has the form of an XOR gate using the 
logical xor function from NumPy, where 100 samples will be assigned the class label 1 and 100 
samples will be assigned the class label -1, respectively: 


>>> np.random.seed(Q) 

27> Kk SOE = Np. tancom.ranan(Z00, 2) 

oe ¥ XO = NPs logical x~or iA xOrls, 0] 2 Uy A xXOT le, 2] 2S 0) 
>>> y xor = np.where(y xor, 1, -1) 


Per DPitw.sSCatler(% xOrly xOr==1, 0], % sorly xor-—-1, 1), 

aes c='b', marker='x', lLlabel='1') 

27 PilLesCalLlel (% 2Onr iy xOre==—-l, Vig A Botly sora=-1, tly 
c='r', marker='s', lLabel='-1') 


Poe Piles Vimo 4 0) 
>>> plt.legend() 
>>> plt.show() 


After executing the code, we will have an XOR dataset with random noise, as shown in the following 
figure: 
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Obviously, we would not be able to separate samples from the positive and negative class very well 
using a linear hyperplane as the decision boundary via the linear logistic regression or linear SVM 
model that we discussed in earlier sections. 


The basic idea behind kernel methods to deal with such linearly inseparable data 1s to create 
nonlinear combinations of the original features to project them onto a higher dimensional space via a 
mapping function O(:) where it becomes linearly separable. As shown 1n the next figure, we can 
transform a two-dimensional dataset onto a new three-dimensional feature space where the classes 
become separable via the following projection: 


$( % 5% ) = (2, 52552) = [Mise +3) 


This allows us to separate the two classes shown 1n the plot via a linear hyperplane that becomes a 
nonlinear decision boundary if we project it back onto the original feature space: 
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Using the kernel trick to find separating hyperplanes in 
higher dimensional space 


To solve a nonlinear problem using an SVM, we transform the training data onto a higher dimensional 


feature space via a mapping function O(:) and train a linear SVM model to classify the data in this 


new feature space. Then we can use the same mapping function O(:) to transform new, unseen data to 


classify it using the linear SVM model. 


However, one problem with this mapping approach is that the construction of the new features 1s 
computationally very expensive, especially if we are dealing with high-dimensional data. This 1s 
where the so-called kernel trick comes into play. Although we didn't go into much detail about how to 
solve the quadratic programming task to train an SVM, in practice all we need is to replace the dot 


o( x" ) o(x i) 


AT ) 


product * ~* by | . In order to save the expensive step of calculating this dot 


k f xl) yl) 3 
product between two points explicitly, we define a so-called kernel function: | = 


| ] 


a(x) a(x”) 


One of the most widely used kernels 1s the Radial Basis Function kernel (RBF kernel) or Gaussian 
kernel: 


(i | 
elt 





Pail j) . | 
k(x", x' ) =exp] — 


~ 


20 


This 1s often simplified to: 


k i, al = exp(—7 AJ) 


lll _ ol 








] 


aE 
Here, 20” isa free parameter that is to be optimized. 
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Roughly speaking, the term kerne/ can be interpreted as a similarity function between a pair of 
samples. The minus sign inverts the distance measure into a similarity score and, due to the 
exponential term, the resulting similarity score will fall into a range between | (for exactly similar 
samples) and O (for very dissimilar samples). 


Now that we defined the big picture behind the kernel trick, let's see if we can train a kernel SVM that 
is able to draw a nonlinear decision boundary that separates the XOR data well. Here, we simply use 
the svc class from scikit-learn that we imported earlier and replace the parameter kernel='linear' 

with kernel='rbf': 


oe? SVM = OVC(KSInel="EOr",», fandom State=0, Gamma—-0.10, C=10.0) 
eo SVMlel LG. XOr, Y Or) 

poe PLOL GSeCision LEGIONS (x xOr, y.xOr, Classi fier=svm) 

>>> plt.legend(loc='upper left") 

>>> plt.show() 


As we can see 1n the resulting plot, the kernel SVM separates the XOR data relatively well: 





The ” parameter, which we set to gamma=0.1, can be understood as a cut-off parameter for the 
Gaussian sphere. If we increase the value for ” , we increase the influence or reach of the training 


samples, which leads to a softer decision boundary. To get a better intuition for ” , let's apply RBF 
kernel SVM to our Iris flower dataset: 


o> Svm = oVC Kernel="tbi*, random State=—U, Gamma=-U.2, CHL.) 
>> SVs. Cie tn cme, ~~ “Crain 
= " = WOW! eBook 


>>> plot decision regions(X combined std 
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y Combined, Cloass1 terol, 
best, Ox=Tange( 105,150) ) 


>>> plt.xlabel('petal length [standardized] ') 
zor DilL~vylabel(*petal wadth (Standardized) *) 
>>> plt.legend(loc='upper left") 

>>> plt.show() 


Since we chose a relatively small value for ” , the resulting decision boundary of the RBF kernel 
SVM model will be relatively soft, as shown in the following figure: 


re 
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aloe 
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Now let's increase the value of ” and observe the effect on the decision boundary: 


>>> 
PP ea 
a 


Poe 
>>> 
Zee 
Pe 


svi’. = OVC (Kernel="rbot”, Landom Stace=—), Gatima=1L00.0, C=1,0) 
SVilieELetm Crain Seo, YY Urdu) 
ploc deciss10n, regvons (x combined std, 

con imed, Cieso eter -ovi, 

Lest. 1dx=range (105,190) ) 
plt.xlabel('petal length [standardized] ') 
plit.ylabel('petal width [standardized] ') 
plt.legend(loc='"upper left') 
plit.show () 


In the resulting plot, we can now see that the decision boundary around the classes 0 and 1 1s much 
tighter using a relatively large value of ” : 
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= = 0 1 2 
petal length [standardized] 


Although the model fits the training dataset very well, such a classifier will likely have a high 


generalization error on unseen data, which illustrates that the optimization of ” also plays an 
important role in controlling overfitting. 
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Decision tree learning 


Decision tree classifiers are attractive models if we care about interpretability. Like the name 
decision tree suggests, we can think of this model as breaking down our data by making decisions 
based on asking a series of questions. 


Let's consider the following example where we use a decision tree to decide upon an activity ona 
particular day: 











i sy 
_ Work to do? | 


Yes | | No 











See 
ann Rain 
siaals) Over- : 
cast 


Go to beach to beach fore] [sonmnes| (Friends busy?) busy? | 
Stay in - a = 


Based on the features 1n our training set, the decision tree model learns a series of questions to infer 
the class labels of the samples. Although the preceding figure illustrated the concept of a decision tree 
based on categorical variables, the same concept applies if our features. This also works if our 
features are real numbers like in the Iris dataset. For example, we could simply define a cut-off value 


along the sepal width feature axis and ask a binary question "sepal width = 2.8 om?" 


Using the decision algorithm, we start at the tree root and split the data on the feature that results in 
the largest information gain (IG), which will be explained in more detail in the following section. In 
an iterative process, we can then repeat this splitting procedure at each child node until the leaves are 
pure. This means that the samples at each node all belong to the same class. In practice, this can result 
in a very deep tree with many nodes, which can easily lead to overfitting. Thus, we typically want to 
prune the tree by setting a limit for the maximal depth of the tree. 
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Maximizing information gain — getting the most bang for 
the buck 


In order to split the nodes at the most informative features, we need to define an objective function 
that we want to optimize via the tree learning algorithm. Here, our objective function 1s to maximize 
the information gain at each split, which we define as follows: 


rT 


| N. | 
IG(D,,f)=1(D,)-> N, [(D,) 


j=l 


Here, fis the feature to perform the split, p and sa are the dataset of the parent and nh child node, 


[ is our impurity measure, ey is the total number of samples at the parent node, and ° 4 is the number 
of samples in the jth child node. As we can see, the information gain is simply the difference between 
the impurity of the parent node and the sum of the child node impurities—the lower the impurity of the 
child nodes, the larger the information gain. However, for simplicity and to reduce the combinatorial 

search space, most libraries (including scikit-learn) implement binary decision trees. This means that 


. a . E,. D 
each parent node is split into two child nodes, ~“" and "2" 


j F it i 


! , Nef , iN ici | 
16(Dj.a)=1(D,}~*#.1( Dg) N21 (Du 


P Pp 





Now, the three impurity measures or splitting criteria that are commonly used in binary decision trees 
are Gini index ( ‘g ), entropy ( Mn ), and the classification error ( fe ). Let's start with the definition 


of entropy for all non-empty classes ( pri lt) #0 ): 


1, (t)=-> p(ilt)log, p(ilt) 


i=] 


Here, pli) is the proportion of the samplesnthat bedongs to class c for a particular node ¢. The 


www.wowebook.org 


entropy 1s therefore 0 1f all samples at a node belong to the same class, and the entropy is maximal if 
we have a uniform class distribution. For example, in a binary class setting, the entropy is 0 if 


p(i=l|t)=1 or PUE=1)=9 Te the classes are distributed uniformly with p(i=1|1)=0.5 a 


n(i=014)=0.5 | a 
ati dl a , the entropy is 1. Therefore, we can say that the entropy criterion attempts to 


maximize the mutual information in the tree. 


nd 


Intuitively, the Gini index can be understood as a criterion to minimize the probability of 
misclassification: 


La(t) = ¥ plile)(-p(ils))=1-¥ rll) 


Similar to entropy, the Gini index 1s maximal if the classes are perfectly mixed, for example, ina 


binary class setting (© = u ): 


[= S05" =().5 
i= 


However, 1n practice both the Gini index and entropy typically yield very similar results and it is 
often not worth spending much time on evaluating trees using different impurity criteria rather than 
experimenting with different pruning cut-offs. 


Another impurity measure is the classification error: 


I, =1—max{} p(i|t)} 


This is a useful criterion for pruning but not recommended for growing a decision tree, since it 1s less 
sensitive to changes 1n the class probabilities of the nodes. We can illustrate this by looking at the two 
possible splitting scenarios shown in the following figure: 
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. D D . 
We start witha dataset ” atthe parentnode ~” that consists of 40 samples fromclass 1 and 40 
7 D.., pi. . ; 
samples from class 2 that we split into two datasets ~“” and "#"" , respectively. The information 


gain using the classification error as a splitting criterion would be the same ( IG, = 0.25 ) in both 


scenario A and B: 


I,(D, )=1-0.5=0.5 


A:T, (Djq)=1-2=0.25 


A: 1, (Duchy) =1 -= =0.25 


A: IG, =9.5 -=0.25 -=0.25 = ().25 


| 
— 
| 


B: ly (Dien ) 
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B: I, (Dig) =1-1=0 


BIG, =0.5-2x--0=0.25 


, . _ B(IG, =0.16) | 
However, the Gini index would favor the split in scenario 7 ’ over scenario 


A UG, =e. 125) , which 1s indeed more pure: 


LG (D, = 1—(0.5° +0.5° ) = (0.5 


ere fo) 2 x 
fly (Dae)=1-](4) (=) |--0ar 


AI, = 05-—0,375 -£0.375 = 0.125 
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B:1G, = 0.5-20.4-0 = 0.16 


B(IG,, =0.19) A(IG,, =0.31) 


Similarly, the entropy criterion would favor scenario over scenario 


I, (D,) =—(0.5 log, (0.5)+0.5 log, (0.5)) =1 


A:ilG, = |-=0.81-0.81=0.19 
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aca tm %.12 2) 4 4 
B:1,,| D4 )}=—| —log,+| — |+—log,+} — | |=0.92 
1 (Dia) eee CO ieee 09: 


B:I.(D...)=0 


A right 
BIG, =1 — —-Q=0.51 


For a more visual comparison of the three different impurity criteria that we discussed previously, 
let's plot the impurity indices for the probability range [0, 1] for class 1. Note that we will also add in 
a scaled version of the entropy (entropy/2) to observe that the Gini index 1s an intermediate measure 
between entropy and the classification error. The code 1s as follows: 


Pee ITMOOre MatplLoclib«pyplou. as: ple 
>>> import numpy as np 
>>> def gini(p): 


see recurs {oye th = fo) ee ok = Bye = C2)? 

>>> def entropy(p): 

= return = P*npwLogZip) = (1. = Dp) 4*npwLogZ( (1 = p)) 
>>> def error(p): 

a return, |. -— np.«amex([p, 1 — p]) 

>>> X = np.arange(0.0, 1.0, Q.01) 

>>> ent = [entropy(p) if p != 0 else None for p in x] 
eo? SC Cnt. = (670.5 2f €¢. else None for € ian env) 

>>> err = [error(1) for 1 in x] 


>>> fig = plt.figure() 

>>> ax = plt.subplot(1l1l) 

Poe OL ay tab, Lo, ~, 2a ZIP Tien, So ene, O11), ere, 
L EMErPODY’; “EnLropy (scaled) *; 
‘ei ak JM ys 
'MiSClasSsitTicaction Error” | ; 


eee =, rout ee 


| *Dback*, ‘lighntoray’, 
‘red', ‘'green', ‘cyan']): 
line = ax.plot(x, i, label=lab, 
se linestyle=ls, lw=2, color=c) 
>>> ax.legend(loc='upper center', bbox to anchor=(0.5, 1.15), 


ncol=3, fancybox=True, VOM4 SBeetalse) 
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>>> ax.axhline (y=0 
>>> ax.axhline(y=1 
eo ee ei io, ull 
Ze Dit. abe | ( p:4 
PoP? Pit. Viabes (sd 
>>> plt.show() 


of IanewLoth=L, color]="k*, Janesryle="*==—") 
.O, Linewidth=1, color='k', linestyle='--') 
oe 

ct) *) 

mpurity Index") 


The plot produced by the preceding code example is as follows: 


— Entropy == Gini Impurity ‘« Misclassification Error 
Entropy (scaled) 






Impurity Index 
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Building a decision tree 


Decision trees can build complex decision boundaries by dividing the feature space into rectangles. 
However, we have to be careful since the deeper the decision tree, the more complex the decision 
boundary becomes, which can easily result in overfitting. Using scikit-learn, we will now traina 
decision tree with a maximum depth of 3 using entropy as a criterion for impurity. Although feature 
scaling may be desired for visualization purposes, note that feature scaling 1s not a requirement for 
decision tree algorithms. The code is as follows: 


>>> from sklearn.tree import DecisionTreeClassifier 

>>> tree = DecisionTreeClassifier(criterion="entropy', 

ae Max GepLhi=s, Lamdom State-v.) 
poo Teeeol Loh. Era ii, iy mer ain, 
Se? 5 COMDILMed = Np.avsteck( (x train, 2% ves) 

eee YY COMOInNed = Np. tSeuack( (yy Craw, Y vest) :) 

>>> plot decision regions (X combined, y combined, 

a Glassiliertq=tree;, Vest 7t0x-range. 105,150) ) 
>>>plt.xlabel('petal length [cm]') 

>>>plt.ylabel('petal width [cm]') 

>>> plt.legend(loc='upper left") 

>>> plt.show () 


After executing the preceding code example, we get the typical axis-parallel decision boundaries of 
the decision tree: 


test set 


petal length 





A nice feature 1n scikit-learn 1s that it allows usot@ export the decision tree as a .dot file after 
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training, which we can visualize using the GraphViz program. This program is freely available at 
http://www.graphviz.org and supported by Linux, Windows, and Mac OS X. 


First, we create the .dot file via scikit-learn using the export graphviz function from the tree 
submodule, as follows: 


eo? Lrom Ssklearn.tree import. export Grapnviz 
2P> SXPOrl GCrapnviz (tree, 
Out tite" Tree.cer 
feature names=|*petal length”, “petal wiath’ |) 


After we have installed GraphViz on our computer, we can convert the tree. dot file into a PNG file 
by executing the following command from the command line in the location where we saved the 
tree.dot file: 


> dot -Tpng tree.dot -o tree.png 


petal width <= 0.7500 
entropy= 1.57991 767826 


samples= 105 
















entropy= 0.0000 petal length <= 4.9500 
samples= 34 entropy= 0.9927976886609 
value= (34. 0. 0.] samples= 71 


petal length <= 1.6500 petal length <= 5.0500 
entropy= 0.43055186701 entropy= 0.1 79256066928 
samples= 34 samples= 37 


entropy= 0.0000 


entropy= 0.8113 entropy= 0.8113 entropy= 0.0000 
samples= 4 samples= 4 samples= 33 
value= [0.1.3.] | value= [0. 1. 3.] value= [0. 0. 33.) 


samples= 30 
value= [0. 30. 0.] 





Looking at the decision tree figure that we created via GraphViz, we can now nicely trace back the 
splits that the decision tree determined from our training dataset. We started with 105 samples at the 
root and split it into two child nodes with 34 and 71 samples each using the petal with cut-off < 0.75 
cm. After the first split, we can see that the left child node is already pure and only contains samples 
from the Iris-Setosa class (entropy = 0). The further splits on the right are then used to separate the 
samples from the Iris-Versicolor and Iris-Virginica classes. 
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Combining weak to strong learners via random forests 


Random forests have gained huge popularity 1n applications of machine learning during the last 
decade due to their good classification performance, scalability, and ease of use. Intuitively, a random 
forest can be considered as an ensemble of decision trees. The idea behind ensemble learning 1s to 
combine weak learners to build a more robust model, a strong learner, that has a better 
generalization error and is less susceptible to overfitting. The random forest algorithm can be 
summarized in four simple steps: 


1. Draw a random bootstrap sample of size n (randomly choose n samples from the training set 
with replacement). 
2. Grow a decision tree from the bootstrap sample. At each node: 
1. Randomly select d features without replacement. 
2. Split the node using the feature that provides the best split according to the objective 
function, for instance, by maximizing the information gain. 


oS) 


. Repeat the steps | to 2 & times. 

4. Aggregate the prediction by each tree to assign the class label by majority vote. Majority voting 
will be discussed in more detail in Chapter 7, Combining Different Models for Ensemble 
Learning. 


There 1s a slight modification in step 2 when we are training the individual decision trees: instead of 
evaluating all features to determine the best split at each node, we only consider a random subset of 
those. 


Although random forests don't offer the same level of interpretability as decision trees, a big 
advantage of random forests 1s that we don't have to worry so much about choosing good 
hyperparameter values. We typically don't need to prune the random forest since the ensemble model 
is quite robust to noise from the individual decision trees. The only parameter that we really need to 
care about in practice 1s the number of trees & (step 3) that we choose for the random forest. 
Typically, the larger the number of trees, the better the performance of the random forest classifier at 
the expense of an increased computational cost. 


Although it is less common in practice, other hyperparameters of the random forest classifier that can 
be optimized—using techniques we will discuss in Chapter 5, Compressing Data via Dimensionality 
Reduction—are the size n of the bootstrap sample (step 1) and the number of features d that 1s 
randomly chosen for each split (step 2.1), respectively. Via the sample size n of the bootstrap sample, 
we control the bias-variance tradeoff of the random forest. By choosing a larger value for n, we 
decrease the randomness and thus the forest is more likely to overfit. On the other hand, we can 
reduce the degree of overfitting by choosing smaller values for n at the expense of the model 
performance. In most implementations, including the RandomForestClassifier implementation in 
scikit-learn, the sample size of the bootstrap sample is chosen to be equal to the number of samples in 
the original training set, which usually provides a a good bias-variance tradeoff. For the number of 
features d at each split, we want to choose, auyalule, fatls.synaller than the total number of features in 


the training set. A reasonable default that is used in scikit-learn and other implementations is d= vm 
, where m is the number of features 1n the training set. 


Conveniently, we don't have to construct the random forest classifier from individual decision trees 
by ourselves; there is already an implementation in scikit-learn that we can use: 


>>> from sklearn.ensemble import RandomForestClassifier 
>>> forest = RandomForestClassifier(criterion='entropy', 
i SstimaLlorse=10, 
rancom .stavte=L, 

sas Mn. JO0S=2) 

Por LOreote et Eran, Vereen) 
ve PLOL. deCcis10n. reg 1 ons(x% Combined, yy Combined, 

oe CIASSifVer=roresl, Test. 1dx=range (109,150). 
>>> plt.xlabel('petal length') 

>>> plt.ylabel ('petal width') 

>>> plt.legend(loc='upper left") 

>>> plt.show () 


After executing the preceding code, we should see the decision regions formed by the ensemble of 
trees in the random forest, as shown in the following figure: 


oBe 0 
xxx 1 
o8o0 2 
O00 test set 


petal length 





Using the preceding code, we trained a random forest from 10 decision trees via the n estimators 
parameter and used the entropy criterion as an impurity measure to split the nodes. Although we are 
erowing a very small random forest from a very small training dataset, we used the n jobs parameter 
for demonstration purposes, which allows us tegparal|elize the model training using multiple cores of 
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our computer (here, two). 
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K-nearest neighbors — a lazy learning 
algorithm 


The last supervised learning algorithm that we want to discuss in this chapter is the k-nearest 
neighbor classifier (IKNN), which is particularly interesting because it is fundamentally different 
from the learning algorithms that we have discussed so far. 


KNN is a typical example of a lazy learner. It 1s called /azy not because of its apparent simplicity, 
but because it doesn't learn a discriminative function from the training data but memorizes the training 
dataset instead. 


Note 
Parametric versus nonparametric models 


Machine learning algorithms can be grouped into parametric and nonparametric models. Using 
parametric models, we estimate parameters from the training dataset to learn a function that can 
classify new data points without requiring the original training dataset anymore. Typical examples of 
parametric models are the perceptron, logistic regression, and the linear SVM. In contrast, 
nonparametric models can't be characterized by a fixed set of parameters, and the number of 
parameters grows with the training data. Two examples of nonparametric models that we have seen 
so far are the decision tree classifier/random forest and the kernel SVM. 


KNN belongs to a subcategory of nonparametric models that is described as instance-based 
learning. Models based on instance-based learning are characterized by memorizing the training 
dataset, and lazy learning 1s a special case of instance-based learning that 1s associated with no (zero) 
cost during the learning process. 


The KNN algorithm itself is fairly straightforward and can be summarized by the following steps: 


1. Choose the number of & and a distance metric. 
2. Find the k nearest neighbors of the sample that we want to classify. 
3. Assign the class label by majority vote. 


The following figure illustrates how a new data point (?) is assigned the triangle class label based on 
majority voting among its five nearest neighbors. 
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Based on the chosen distance metric, the KNN algorithm finds the & samples in the training dataset 
that are closest (most similar) to the point that we want to classify. The class label of the new data 
point is then determined by a majority vote among its / nearest neighbors. 


The main advantage of such a memory-based approach is that the classifier immediately adapts as we 
collect new training data. However, the downside is that the computational complexity for classifying 
new samples grows linearly with the number of samples in the training dataset in the worst-case 
scenario—unless the dataset has very few dimensions (features) and the algorithm has been 
implemented using efficient data structures such as KD-trees. J. H. Friedman, J. L. Bentley, and R. A. 
Finkel. An algorithm for finding best matches 1n logarithmic expected time. ACM Transactions on 
Mathematical Software (TOMS), 3(3):209—226, 1977. Furthermore, we can't discard training 
samples since no training step is involved. Thus, storage space can become a challenge if we are 
working with large datasets. 


By executing the following code, we will now implement a KNN model in scikit-learn using an 
Euclidean distance metric: 


>>> from sklearn.neighbors import KNeighborsClassifier 

poe Koh = KNelgonborsC lassi tier (n. nelgnbors—), p=Z; 

oe metric='minkowski'") 

yor Kiel VE (X Tall .stG, y- train) 

>>> plot decision regions(X combinedwowl eBoolkcombined, 
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tes Classiirver=knn, Best 10x>raenge(Llo,150)) 
>>> plt.xlabel('petal length [standardized] ') 

eo Dts Vilebelt* petal widen [Letendardized)*) 

>>> plt.show () 


By specifying five neighbors in the KNN model for this dataset, we obtain a relatively smooth 
decision boundary, as shown in the following figure: 
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Note 


In the case of a tie, the scikit-learn implementation of the KNN algorithm will prefer the neighbors 
with a closer distance to the sample. If the neighbors have a similar distance, the algorithm will 
choose the class label that comes first in the training dataset. 


The right choice of k is crucial to find a good balance between over- and underfitting. We also have 
to make sure that we choose a distance metric that 1s appropriate for the features in the dataset. Often, 
a simple Euclidean distance measure 1s used for real-valued samples, for example, the flowers in our 
Iris dataset, which have features measured 1n centimeters. However, if we are using a Euclidean 
distance measure, it is also important to standardize the data so that each feature contributes equally 
to the distance. The 'minkowski' distance that we used in the previous code is just a generalization 
of the Euclidean and Manhattan distance that can be written as follows: 
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pP 








=p YY a" x) 
aan 


It becomes the Euclidean distance if we set the parameter p=2 or the Manhatten distance at p=1, 
respectively. Many other distance metrics are available in scikit-learn and can be provided to the 
metric parameter. They are listed at http://scikit- 
learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html. 





The curse of dimensionality 


It is important to mention that KNN 1s very susceptible to overfitting due to the curse of 
dimensionality. The curse of dimensionality describes the phenomenon where the feature space 
becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset. 
Intuitively, we can think of even the closest neighbors being too far away in a high-dimensional space 
to give a good estimate. 


We have discussed the concept of regularization 1n the section about logistic regression as one way to 
avoid overfitting. However, in models where regularization is not applicable such as decision trees 
and KNN, we can use feature selection and dimensionality reduction techniques to help us avoid the 
curse of dimensionality. This will be discussed in more detail 1n the next chapter. 
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Summary 


In this chapter, you learned about many different machine algorithms that are used to tackle linear and 
nonlinear problems. We have seen that decision trees are particularly attractive if we care about 
interpretability. Logistic regression 1s not only a useful model for online learning via stochastic 
gradient descent, but also allows us to predict the probability of a particular event. Although support 
vector machines are powerful linear models that can be extended to nonlinear problems via the kernel 
trick, they have many parameters that have to be tuned in order to make good predictions. In contrast, 
ensemble methods such as random forests don't require much parameter tuning and don't overfit so 
easily as decision trees, which makes it an attractive model for many practical problem domains. The 
K-nearest neighbor classifier offers an alternative approach to classification via lazy learning that 
allows us to make predictions without any model training but with a more computationally expensive 
prediction step. 


However, even more important than the choice of an appropriate learning algorithm is the available 
data in our training dataset. No algorithm will be able to make good predictions without informative 
and discriminatory features. 


In the next chapter, we will discuss important topics regarding the preprocessing of data, feature 
selection, and dimensionality reduction, which we will need to build powerful machine learning 
models. Later in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter 
Tuning, we will see how we can evaluate and compare the performance of our models and learn 
useful tricks to fine-tune the different algorithms. 
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Chapter 4. Building Good Training Sets — Data 
Preprocessing 


The quality of the data and the amount of useful information that 1t contains are key factors that 
determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that 
we make sure to examine and preprocess a dataset before we feed it to a learning algorithm. In this 
chapter, we will discuss the essential data preprocessing techniques that will help us to build good 
machine learning models. 


The topics that we will cover in this chapter are as follows: 


e Removing and imputing missing values from the dataset 
e Getting categorical data into shape for machine learning algorithms 
e Selecting relevant features for the model construction 
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Dealing with missing data 


It is not uncommon in real-world applications that our samples are missing one or more values for 
various reasons. There could have been an error in the data collection process, certain measurements 
are not applicable, particular fields could have been simply left blank in a survey, for example. We 
typically see missing values as the blank spaces in our data table or as placeholder strings such as 
Nan (Not A Number). 


Unfortunately, most computational tools are unable to handle such missing values or would produce 
unpredictable results if we simply ignored them. Therefore, it 1s crucial that we take care of those 
missing values before we proceed with further analyses. But before we discuss several techniques for 
dealing with missing values, let's create a simple example data frame from a CSV (comma- 
Separated values) file to get a better grasp of the problem: 


>>> import pandas as pd 
yor EEOM 120 2mpOre SEring lo 


eee CSV Ode = 7°" fy By Cy) 
Op 2 Op lg oO 

e Duy 0407 7s 0 
fae Gel pdlady ie aw y 
>>> # If you are using Python 2.7, you need 
>>> # to convert the string to unicode: 
>>> # csv data = unicode(csv_ data) 
err Or = PO~.tead. CSV (otringlO(csyv Gata) ) 
2S OFT 

A B S D 
Cee Z 3 4 
1 5 6 NaN 8 
2 0 11 12 NaN 


Using the preceding code, we read CSV-formatted data into a pandas DataFrame via the read csv 
function and noticed that the two missing cells were replaced by Nan. The st ringto function in the 
preceding code example was simply used for the purposes of illustration. It allows us to read the 
string assigned to csv data into a pandas DataFrame as if it was a regular CSV file on our hard 
drive. 


For a larger DataFrame, 1t can be tedious to look for missing values manually; in this case, we can 
use the isnull method to return a DataFrame with Boolean values that indicate whether a cell 
contains a numeric value (False) or if data is missing (True). Using the sum method, we can then 
return the number of missing values per column as follows: 


>>> df.isnull().sum() 
A 0 
B 0 
c 1 
D il 


dtype: into4 
WOW! eBook 
www.wowebook.org 


This way, we can count the number of missing values per column; 1n the following subsections, we 
will take a look at different strategies for how to deal with this missing data. 


Note 


Although scikit-learn was developed for working with NumPy arrays, 1t can sometimes be more 
convenient to preprocess data using pandas' DataFrame. We can always access the underlying NumPy 
array of the DataFrame via the values attribute before we feed it into a scikit-learn estimator: 


>>> dfi.values 

array([[ l1., a ore 4.], 
 aneey om nan, Cra. i 
[. “Libhs g 1s eg a nan] | 
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Eliminating samples or features with missing values 


One of the easiest ways to deal with missing data 1s to simply remove the corresponding features 
(columns) or samples (rows) from the dataset entirely; rows with missing values can be easily 
dropped via the dropna method: 


Por OL sOropia() 
A B C OD 
O 1 2 3 4 


Similarly, we can drop columns that have at least one Nan in any row by setting the axis argument to 
i 


>>> df.dropna(axis=1) 


A B 
O 1 Z 
. 6 
Z OO Ji 


The dropna method supports several additional parameters that can come in handy: 


# only drop rows where all columns are NaN 
>>> df.dropna (how='all') 


# drop rows that have not at least 4 non-NaN values 
>>> df.dropna(thresh=4) 


# only drop rows where NaN appear in specific columns (here: 'C') 
>>> df.dropna(subset=['C']) 


Although the removal of missing data seems to be a convenient approach, it also comes with certain 
disadvantages; for example, we may end up removing too many samples, which will make a reliable 
analysis impossible. Or, 1f we remove too many feature columns, we will run the risk of losing 
valuable information that our classifier needs to discriminate between classes. In the next section, we 
will thus look at one of the most commonly used alternatives for dealing with missing values: 
interpolation techniques. 
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Imputing missing values 


Often, the removal of samples or dropping of entire feature columns 1s simply not feasible, because 
we might lose too much valuable data. In this case, we can use different interpolation techniques to 
estimate the missing values from the other training samples 1n our dataset. One of the most common 
interpolation techniques is mean imputation, where we simply replace the missing value by the mean 
value of the entire feature column. A convenient way to achieve this 1s by using the Imputer class 
from scikit-learn, as shown in the following code: 


>>> from sklearn.preprocessing import Imputer 
eo? im = Impucrerimissing VvValues="NaN’, Stravegy=]= Mean", axis=0) 
FOS Aime =. ame. teeta) 
27> AMpPUCEO Cata = 1Mr.tCranstorm (at. Vvalues) 
PPP? AMPULEG Cala 
aera tili dua, Diag Co , 
L Day 6., cre 
Pr We, Plan Day .1]) 
Here, we replaced each Nan value by the corresponding mean, which is separately calculated for 
each feature column. If we changed the setting axis=0 to axis=1, we'd calculate the row means. 
Other options for the strategy parameter are median Of most frequent, where the latter replaces 


the missing values by the most frequent values. This is useful for imputing categorical feature values. 
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Understanding the scikit-learn estimator API 


In the previous section, we used the Imputer class from scikit-learn to impute missing values 1n our 
dataset. The tmputer class belongs to the so-called transformer classes in scikit-learn that are used 
for data transformation. The two essential methods of those estimators are fit and transform. The 
fit method 1s used to learn the parameters from the training data, and the transform method uses 
those parameters to transform the data. Any data array that is to be transformed needs to have the 
same number of features as the data array that was used to fit the model. The following figure 
illustrates how a transformer fitted on the training data is used to transform a training dataset as well 
as a new test dataset: 


Training Test 
Data Data 


est. fit(X_train) 


est.transform(X train) / est.transform(X test) 


Transformed Transformed 
Training Data Test Data 





The classifiers that we used in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit- 
Learn, belong to the so-called estimators 1n scikit-learn with an API that 1s conceptually very similar 
to the transformer class. Estimators have a predict method but can also have a transform method, 
as we will see later. As you may recall, we also used the £it method to learn the parameters of a 
model when we trained those estimators for classification. However, 1n supervised learning tasks, we 
additionally provide the class labels for fitting the model, which can then be used to make predictions 
about new data samples via the predict method, as illustrated in the following figure: 
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Training Training 
Data Labels 








est.predict(X_train, y_train) 


est.predict(X_test) 


Predicted 
labels 
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Handling categorical data 


So far, we have only been working with numerical values. However, it is not uncommon that real- 
world datasets contain one or more categorical feature columns. When we are talking about 
categorical data, we have to further distinguish between nominal and ordinal features. Ordinal 
features can be understood as categorical values that can be sorted or ordered. For example, 7-shirt 
size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal 
features don't imply any order and, to continue with the previous example, we could think of shirt 
color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger 
than blue. 


Before we explore different techniques to handle such categorical data, let's create a new data frame 
to illustrate the problem: 


>>> import pandas as pd 

>>> df = pd.DataFrame ([ 

['green', 'M', 10.1, ‘'classl'], 
L*eeo"*, “bh, 15.5, “elasszZ* I; 
[*biaie*, *Xie*> kS23, *elascs.l*] ]) 


>>> dfi.columns l*eoOlor*, “Size*, “price*, “classlabel" | 


SoS sO 

color size price classlabel 
O green M A) ens Closet 
al red L Rc ree eileaseZ 
Z blue XL ies res: classl 


As we can see 1n the preceding output, the newly created DataFrame contains a nominal feature 
(color), an ordinal feature (size), and a numerical feature (price) column. The class labels 
(assuming that we created a dataset for a supervised learning task) are stored 1n the last column. The 
learning algorithms for classification that we discuss in this book do not use ordinal information in 
class labels. 
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Mapping ordinal features 


To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert 
the categorical string values into integers. Unfortunately, there 1s no convenient function that can 
automatically derive the correct order of the labels of our size feature. Thus, we have to define the 
mapping manually. In the following simple example, let's assume that we know the difference 


between features, for example, XL =L+l=M+2_ 


277 SiZS Mapping: = 4 
in Sy 
lle. = 23 
oes ‘hts 1} 
Poe ORS 4e" | = Ol) SiZe” | aMeapiei.Ze Mapp iG) 
ao OL 
color size price classlabel 
O green 1 TE: od classl 
1 red Z a class2 
Zz blue S iio ee, Class. 


If we want to transform the integer values back to the original string representation at a later stage, we 
can simply define a reverse-mapping dictionary inv size mapping = {v: k for k, v in 

size mapping.items () } that can then be used via the pandas' map method on the transformed 
feature column similar to the size mapping dictionary that we used previously. 


WOW! eBook 
www.wowebook.org 


Encoding class labels 


Many machine learning libraries require that class labels are encoded as integer values. Although 
most estimators for classification in scikit-learn convert class labels to integers internally, it 1s 
considered good practice to provide class labels as integer arrays to avoid technical glitches. To 
encode the class labels, we can use an approach similar to the mapping of ordinal features discussed 
previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer 
number we assign to a particular string-label. Thus, we can simply enumerate the class labels starting 
at 0: 


>>> IMpOrt numpy as np 

Pro Class: Mapping = {label tex for 10x, Label. an 

oes enumerate (np.unique(df['classlabel']))} 
PP? Class Mapping 
(*@leassi*: 0; *“elessZ*: 1} 


Next we can use the mapping dictionary to transform the class labels into integers: 


277 Of" Classiabe.” | = dil eclasslabel.” | ~<mep(class mapping) 
a Or 
color size price classlabel 
O green 1 TE cal 0 
sk red Z ieee, il 
2 blue 3 izowe: 0 


We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class 
labels back to the original string representation: 


er? ANY Class Mepping = ve K for Ky V 2m Class Mapping. vems (> | 
27? OL | Classiaboel” | = dil Chass lave.” | «map (inv Class. mapping) 
oS OF 

color size price classlabel 
O green al AO) gal! classl 
di red 2 ie Fes. GlasisZ 
2 blue 3 Le Clase. 


Alternatively, there is a convenient LabelEncoder Class directly implemented 1n scikit-learn to 
achieve the same: 


>>> from sklearn.preprocessing import LabelEncoder 

yor Class Le = bape Encoder () 

goo YS Chass esti Leones orm(at | “Chass Lebel” | «7a.ues) 
> aa | 

array([0, 1, Q]) 


Note that the fit transform method 1s just a shortcut for calling fit and transform Separately, and 
we can use the inverse transform method to transform the integer class labels back into their 
original string representation: 
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>>> class le.inverse transform(y) 
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array(['classl', 'class2', ‘'classl'], dtype=object) 
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Performing one-hot encoding on nominal features 


In the previous section, we used a simple dictionary-mapping approach to convert the ordinal size 
feature into integers. Since scikit-learn's estimators treat class labels without any order, we used the 
convenient LabelEncoder Class to encode the string labels into integers. It may appear that we could 
use a similar approach to transform the nominal color column of our dataset, as follows: 


>>> X = df[['color', 'size', 'price']].values 
Po? COLOL 16 = babeLEncoder 

oe milag Uh = COlOt Le.01 Trans rorm ics, 0:1.) 
>>> X 

array (i'l, Ly 20.7] 


[Ze 2H Oey 
[0O, 3, 15.3]], dtype=object) 


After executing the preceding code, the first column of the NumPy array x now holds the new color 
values, which are encoded as follows: 


e blue — 0 
e green — | 
e red — 2 


If we stop at this point and feed the array to our classifier, we will make one of the most common 
mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't 
come in any particular order, a learning algorithm will now assume that green 1s larger than blue, and 
red is larger than green. Although this assumption 1s incorrect, the algorithm could still produce 
useful results. However, those results would not be optimal. 


A common workaround for this problem is to use a technique called one-hot encoding. The idea 
behind this approach is to create a new dummy feature for each unique value in the nominal feature 
column. Here, we would convert the color feature into three new features: blue, green, and red. 
Binary values can then be used to indicate the particular color of a sample; for example, a blue 
sample can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the 
OneHotEncoder that is implemented in the scikit-learn.preprocessing module: 


>>> from sklearn.preprocessing import OneHotEncoder 


pee Ole = OiehorenCode, (C2ceCC ea! Teac ooh) 
eer One, 11 trans tort A) «toarray |) 
array([[ 0. , i ¢ O. , te f 20a] 


L Ge 4 Ce 4g Le oo Ze pf towol, 


[ke y We gy O. , Oa ¢ doe 


When we initialized the oneHotEncoder, we defined the column position of the variable that we 
want to transform via the categorical features parameter (note that color 1s the first column in 
the feature matrix x). By default, the oneHotEncoder returns a sparse matrix when we use the 
transform method, and we converted the sparse matrix representation into a regular (dense) NumPy 


array for the purposes of visualization via the wowregapmethod. Sparse matrices are simply a more 
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efficient way of storing large datasets, and one that is supported by many scikit-learn functions, which 
is especially useful 1f1t contains a lot of zeros. To omit the toarray step, we could initialize the 
encoder as OneHotEncoder (..., sparse=False) to returna regular NumPy array. 


An even more convenient way to create those dummy features via one-hot encoding 1s to use the 
get dummies method implemented in pandas. Applied ona DataFrame, the get dummies method 
will only convert string columns and leave all other columns unchanged: 


22> POK~Get OUuMmmves (Or | |’ price”, *“Ccoblor’, "size" | )) 

Price S176 color Dive Color green COobor red 

0 OAL 1 Q di 0 

i kc ree 2 Q Q i 

Z ore 3 i 0 Q 
WOW! eBook 


www.wowebook.org 


Partitioning a dataset in training and test sets 


We briefly introduced the concept of partitioning a dataset into separate datasets for training and 
testing in Chapter 1, Giving Computers the Ability to Learn from Data, and Chapter 3, A Tour of 
Machine Learning Classifiers Using Scikit-learn. Remember that the test set can be understood as 
the ultimate test of our model before we let it loose on the real world. In this section, we will 
prepare a new dataset, the Wine dataset. After we have preprocessed the dataset, we will explore 
different techniques for feature selection to reduce the dimensionality of a dataset. 


The Wine dataset is another open-source dataset that is available from the UCI machine learning 


repository (https://archive.ics.uci.edu/ml/datasets/Wine); 1t consists of 178 wine samples with 13 
features describing their different chemical properties. 


Using the pandas library, we will directly read in the open source Wine dataset from the UCI machine 
learning repository: 


>>> df wine = pd.read csv('https://archive.ics.uci.edu/ml/machine-learning- 
databases/wine/wine.data', header=None) 
eo? OF Wane Columns = | "Chass abel", “AveonoL”, 
"Malic acid', '‘'Ash', 
"Alcalinity of ash', 'Magnesium', 
"Total phenols', 'Flavanoids' 
"'Nonflavanoid phenols' 
'PrOantnocyanins' 
"'COLOYr antensity*; *hue*, 
"OD280/0D315 of diluted wines' 
ss 'eProlLane* | 
Per Prine t( "Class lavels*, Dp.sunique(ct wine ("Class jabel*])> 
Class labels [1 2 3] 
27? Of, Wine.need () 


The 13 different features in the Wine dataset, describing the chemical properties of the 178 wine 
samples, are listed in the following table: 

, —_ (00280/00315 

| Proanthocyanins ‘Hue of diluted 


intensity | ‘canes 


2.29 5.64 | 1,043.92 


(0.86 |3.45 





1.04 2.93 


The samples belong to one of three different classes, 1, 2, and 3, which refer to the three different 
types of grapes that have been grown in differamwegiens in Italy. 
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A convenient way to randomly partition this dataset into a separate test and training dataset is to use 
the train test split function from scikit-learn's cross validation submodule: 


yo? EYOM SKlLOadinsCLOssS Validation Import train vest splat 
Poo Ky SY = OF Wines 2OC(G,. te i].Values,; Gl wine. JoCcl., U)evalues 
>>> X_ train, X_ test, y train, y test = \ 

trait Lest Split (xy, VY, Test. S126-0.3, Tancom state=—0) 


First, we assigned the NumPy array representation of feature columns 1-13 to the variable x, and we 
assigned the class labels from the first column to the variable y. Then, we used the 

train test split function to randomly split x and y into separate training and test datasets. By 
setting test size=0.3 we assigned 30 percent of the wine samples to x test and y test, and the 
remaining 70 percent of the samples were assigned to x train andy train, respectively. 


Note 


If we are dividing a dataset into training and test datasets, we have to keep in mind that we are 
withholding valuable information that the learning algorithm could benefit from. Thus, we don't want 
to allocate too much information to the test set. However, the smaller the test set, the more inaccurate 
the estimation of the generalization error. Dividing a dataset into training and test sets is all about 
balancing this trade-off. In practice, the most commonly used splits are 60:40, 70:30, or 80:20, 
depending on the size of the initial dataset. However, for large datasets, 90:10 or 99:1 splits into 
training and test subsets are also common and appropriate. Instead of discarding the allocated test 
data after model training and evaluation, it 1s a good idea to retrain a classifier on the entire dataset 
for optimal performance. 
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Bringing features onto the same scale 


Feature scaling is a crucial step in our preprocessing pipeline that can easily be forgotten. Decision 
trees and random forests are one of the very few machine learning algorithms where we don't need to 
worry about feature scaling. However, the majority of machine learning and optimization algorithms 
behave much better if features are on the same scale, as we saw in Chapter 2, 7raining Machine 
Learning Algorithms for Classification, when we implemented the gradient descent optimization 
algorithm. 


The importance of feature scaling can be illustrated by a simple example. Let's assume that we have 
two features where one feature is measured ona scale from | to 10 and the second feature 1s 
measured on a scale from | to 100,000. When we think of the squared error function in Adaline in 
Chapter 2, Training Machine Learning Algorithms for Classification, it 1s intuitive to say that the 
algorithm will mostly be busy optimizing the weights according to the larger errors in the second 
feature. Another example 1s the k-nearest neighbors (IKNN) algorithm with a Euclidean distance 
measure; the computed distances between samples will be dominated by the second feature axis. 


Now, there are two common approaches to bringing different features onto the same scale: 
normalization and standardization. Those terms are often used quite loosely in different fields, and 
the meaning has to be derived from the context. Most often, normalization refers to the rescaling of the 
features to a range of [0, 1], which is a special case of min-max scaling. To normalize our data, we 


At) 


can simply apply the min-max scaling to each feature column, where the new value “norm of a sample 
Af) 
* can be calculated as follows: 


ms . 
(i) eee A + nin 
~ PH 


"* aX A niin 


Ate ; ; ; : 
Here, * 1s a particular sample, “min ig the smallest value ina feature column, and “max the largest 
value, respectively. 


The min-max scaling procedure 1s implemented in scikit-learn and can be used as follows: 


>>> from sklearn.preprocessing import MinMaxScaler 


>>> mms = MinMaxScaler() 
2>> K Crain norm = nims.f1t. Cranstorm(x train) 
27> & Cest NOrm = mlis,.traenstorm(x test) 


Although normalization via min-max scaling 1s a commonly used technique that is useful when we 
need values in a bounded interval, standardization can be more practical for many machine learning 
algorithms. The reason is that many linear modejwy suchias the logistic regression and SVM that we 
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remember from Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, initialize the 
weights to 0 or small random values close to 0. Using standardization, we center the feature columns 
at mean 0 with standard deviation | so that the feature columns take the form of a normal distribution, 
which makes it easier to learn the weights. Furthermore, standardization maintains useful information 
about outliers and makes the algorithm less sensitive to them in contrast to min-max scaling, which 
scales the data to a limited range of values. 


The procedure of standardization can be expressed by the following equation: 


Af 
i) * HA, 


As =" 
a 


sta 
i 


Here, “* is the sample mean of a particular feature column and ° the corresponding standard 
deviation, respectively. 


The following table illustrates the difference between the two commonly used feature scaling 
techniques, standardization and normalization on a simple sample dataset consisting of numbers 0 to 


pen standardized harman 


Similar to MinMaxScaler, scikit-learn also implements a class for standardization: 





>>> from sklearn.preprocessing import StandardScaler 
2or STOSC = Standardscaler () 

oor wo breil Cc, = Sees. itt Tren orm rein, 

277 ees Se. = SLOsc~ Cranston esr) 


Again, it is also important to highlight that wewdwherspkndardScaler only once on the training data 
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and use those parameters to transform the test set or any new data point. 
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Selecting meaningful features 


If we notice that a model performs much better on a training dataset than on the test dataset, this 
observation is a strong indicator for overfitting. Overfitting means that model fits the parameters too 
closely to the particular observations in the training dataset but does not generalize well to real data 
—we say that the model has a high variance. A reason for overfitting is that our model is too 
complex for the given training data and common solutions to reduce the generalization error are listed 
as follows: 


Collect more training data 

Introduce a penalty for complexity via regularization 
Choose a simpler model with fewer parameters 
Reduce the dimensionality of the data 


Collecting more training data 1s often not applicable. In the next chapter, we will learn about a useful 
technique to check whether more training data 1s helpful at all. In the following sections and 
subsections, we will look at common ways to reduce overfitting by regularization and dimensionality 
reduction via feature selection. 
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Sparse solutions with L1 regularization 


We recall from Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, that L2 
regularization is one approach to reduce the complexity of a model by penalizing large individual 
weights, where we defined the L2 norm of our weight vector w as follows: 


= LW 


w| 














Another approach to reduce the model complexity is the related L1 regularization: 


HI 


Sm, 


j=! 


EL: 














W 


Here, we simply replaced the square of the weights by the sum of the absolute values of the weights. 
In contrast to L2 regularization, L] regularization yields sparse feature vectors; most feature weights 
will be zero. Sparsity can be useful in practice 1f we have a high-dimensional dataset with many 
features that are irrelevant, especially cases where we have more irrelevant dimensions than samples. 
In this sense, L] regularization can be understood as a technique for feature selection. 


To better understand how LI regularization encourages sparsity, let's take a step back and take a look 
at a geometrical interpretation of regularization. Let's plot the contours of a convex cost function for 


two weight coefficients and "2. Here, we will consider the sum of the squared errors (SSE) 
cost function that we used for Adaline in Chapter 2, Training Machine Learning Algorithms for 
Classification, since it is symmetrical and easier to draw than the cost function of logistic regression; 
however, the same concepts apply to the latter. Remember that our goal is to find the combination of 
weight coefficients that minimize the cost function for the training data, as shown in the following 
figure (the point in the middle of the ellipses): 
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Minimize cost 





Now, we can think of regularization as adding a penalty term to the cost function to encourage smaller 
weights; or, in other words, we penalize large weights. 


Thus, by increasing the regularization strength via the regularization parameter 4 we shrink the 
weights towards zero and decrease the dependence of our model on the training data. Let's illustrate 
this concept in the following figure for the L2 penalty term. 
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Ws 
Minimize cost 





Allwll5 


Minimize penalty Minimize cost + penalty 


The quadratic L2 regularization term is represented by the shaded ball. Here, our weight coefficients 
cannot exceed our regularization budget—the combination of the weight coefficients cannot fall 
outside the shaded area. On the other hand, we still want to minimize the cost function. Under the 
penalty constraint, our best effort 1s to choose the point where the L2 ball intersects with the contours 


of the unpenalized cost function. The larger the value of the regularization parameter A gets, the 
faster the penalized cost function grows, which leads to a narrower L2 ball. For example, if we 
increase the regularization parameter towards infinity, the weight coefficients will become effectively 
zero, denoted by the center of the L2 ball. To summarize the main message of the example: our goal 1s 
to minimize the sum of the unpenalized cost function plus the penalty term, which can be understood 
as adding bias and preferring a simpler model to reduce the variance in the absence of sufficient 
training data to fit the model. 


Now let's discuss L1 regularization and sparsity. The main concept behind L1 regularization 1s 
similar to what we have discussed here. However, since the L1 penalty is the sum of the absolute 
weight coefficients (remember that the L2 term is quadratic), we can represent it as a diamond shape 
budget, as shown in the following figure: 
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W's 






Allwlly, 


Minimize cost + penalty 
(w, = 0) 


In the preceding figure, we can see that the contour of the cost function touches the L1 diamond at 


"i =9 Since the contours of an LI] regularized system are sharp, it is more likely that the optimum— 


that 1s, the intersection between the ellipses of the cost function and the boundary of the L1 diamond 
—1is located on the axes, which encourages sparsity. The mathematical details of why Ll 
regularization can lead to sparse solutions are beyond the scope of this book. If you are interested, an 
excellent section on L2 versus L1 regularization can be found in section 3.4 of The Elements of 
Statistical Learning, Trevor Hastie, Robert Tibshirani, and Jerome Friedman, Springer. 


For regularized models in scikit-learn that support L1 regularization, we can simply set the penalty 
parameter to '11' to yield the sparse solution: 


yor trom SkiGatnalinear mocel Import Log sLicRegressi on 
>>> LogisticRegression(penalty='11') 


Applied to the standardized Wine data, the L1 regularized logistic regression would yield the 
following sparse solution: 


>>> lr = LogisticRegression(penalty='l1l', C=0.1) 

Por diet tex Elaine cd, Y train) 

oo Pile TeaininG CCureacys°, 12esCore(x Vrain sta, YY. brah) ) 

Training accuracy: 0.9838/096/7/42 

>>> print('Test accuracy:', Ilr.scoreWwOowesBdoktd, y test) ) 
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Test accuracy: 0.981481481481 


Both training and test accuracies (both 98 percent) do not indicate any overfitting of our model. When 
we access the intercept terms via the lr.intercept_ attribute, we can see that the array returns three 
values: 


27> Me. POLO CSD | 
arvay Ul-U.0857I497% SUs.LoCUSoo , —Os 700479050) ) 


Since we the fit the LogisticRegression object ona multiclass dataset, it uses the One-vs-Rest 
(OvR) approach by default where the first intercept belongs to the model that fits class | versus class 
2 and 3; the second value 1s the intercept of the model that fits class 2 versus class | and 3; and the 
third value is the intercept of the model that fits class 3 versus class | and 2, respectively: 


Pe eee 

atrayt(|, Usa200, U.000, 0.000, =0.072072, 0.000, 
00, Verity. U2000, 2000; 2.000, 
sOUUy D000, dete ols 

ott, ~“U.0680 » =0.0G72, 0.000, 0.000, 
SOO, Us-000, O.000,7 O.000, 0.97 7, 
6060, Us000, =U.o7k], 

2000, 0. U0l, U2000, Us000, 0.000, 
y0U0, =UsG57, U.000, 0.000, U.2499, 
500, =—Useo a, Us 00U 
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The weight array that we accessed via the 1r.coef_ attribute contains three rows of weight 
coefficients, one weight vector for each class. Each row consists of 13 weights where each weight 1s 
multiplied by the respective feature in the 13-dimensional Wine dataset to calculate the net input: 


_ pee Fi pecee tea - 
Z=WXt+::-+W,X, = >. xw,=Ww x 


j= 


We notice that the weight vectors are sparse, which means that they only have a few non-zero entries. 
As aresult of the L1 regularization, which serves as a method for feature selection, we just trained a 
model that is robust to the potentially irrelevant features in this dataset. 


Lastly, let's plot the regularization path, which 1s the weight coefficients of the different features for 
different regularization strengths: 


2o>> AMDOLL. MabpLotlib.pyolou as Lt 

>>> fig = plt.figure() 

>>> ax = plt.subplot(1l1l) 

>>> colors = ['blue', ‘'green', ‘'red', ‘cyan', 
"magenta', 'yellow', ‘black', 
fpink', 'lightgreen', 'lightblue', 
'gray', ‘indigo', 'oranweW eBook 
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>>> 
>>> 


>>> 
>>> 


>>> 
>>> 
>>> 
>>> 
ee 
>>> 
>>> 


>>> 


weights, params = [], [] 
for c in np.arange(-4, 6): 


lr = LogisticRegression(penalty='ll', 
C=L0"*C, 
tandem .State=—0) 
Li<t its train std, Y tiaiim) 
weilghts.append({lr.coert [1]) 
params.append(10*¥*c) 


weights = np.array (weights) 


EOr column, 


isaor 
OLL.< 
ioaloe 
jonker 
jonior 
over 


plt.plot(params, weights[:, column], 


label=df wine.columns[column+l], 


COLOLr=CO.LOT) 


axhiane (0, color="black’, lanestyle="==', 


SL easy ee er). ) 
ylabel ('weight coefficient') 
Xxlabel ('C') 

xscale('log'") 
legend(loc='upper left") 


ax.legend(loc="upper center’, 


OJ 


bOOxX GO enichot— (i.e, Ls0e); 
ncol=l1, fancybox=True) 
show () 


color in zip(range(weights.shape[1]), 


GOLOrS). = 


linewidth=3) 


The resulting plot provides us with further insights about the behavior of L1 regularization. As we can 
see, all features weights will be zero if we penalize the model with a strong regularization parameter 


(© <0.1). C is the inverse of the regularization parameter “. 


weight coefficient 
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Alcohol 

Malic acid 

Ash 

Alcalinity of ash 
Magnesium 
Total phenols 
Flavanoids 


Nonflavanoid phenols 
Proanthocyanins 

Color intensity 

Hue 

0D280/0D315 of diluted wines 
Proline 


Sequential feature selection algorithms 


An alternative way to reduce the complexity of the model and avoid overfitting is dimensionality 
reduction via feature selection, which 1s especially useful for unregularized models. There are two 
main categories of dimensionality reduction techniques: feature selection and feature extraction. 
Using feature selection, we select a subset of the original features. In feature extraction, we derive 
information from the feature set to construct a new feature subspace. In this section, we will take a 
look at a classic family of feature selection algorithms. In the next chapter, Chapter 5, Compressing 
Data via Dimensionality Reduction, we will learn about different feature extraction techniques to 
compress a dataset onto a lower dimensional feature subspace. 


Sequential feature selection algorithms are a family of greedy search algorithms that are used to 
reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d. The 
motivation behind feature selection algorithms is to automatically select a subset of features that are 
most relevant to the problem to improve computational efficiency or reduce the generalization error 
of the model by removing irrelevant features or noise, which can be useful for algorithms that don't 
support regularization. A classic sequential feature selection algorithm is Sequential Backward 
Selection (SBS), which aims to reduce the dimensionality of the initial feature subspace with a 
minimum decay in performance of the classifier to improve upon computational efficiency. In certain 
cases, SBS can even improve the predictive power of the model 1f a model suffers from overfitting. 


Note 


Greedy algorithms make locally optimal choices at each stage of a combinatorial search problem and 
generally yield a suboptimal solution to the problem 1n contrast to exhaustive search algorithms, 
which evaluate all possible combinations and are guaranteed to find the optimal solution. However, 
in practice, an exhaustive search is often computationally not feasible, whereas greedy algorithms 
allow for a less complex, computationally more efficient solution. 


The idea behind the SBS algorithm is quite simple: SBS sequentially removes features from the full 
feature subset until the new feature subspace contains the desired number of features. In order to 


determine which feature is to be removed at each stage, we need to define criterion function J that 
we want to mmmmize. The criterion calculated by the criterion function can simply be the difference in 
performance of the classifier after and before the removal of a particular feature. Then the feature to 
be removed at each stage can simply be defined as the feature that maximizes this criterion; or, in 
more intuitive terms, at each stage we eliminate the feature that causes the least performance loss 

after removal. Based on the preceding definition of SBS, we can outline the algorithm in 4 simple 
Steps: 


1. Initialize the algorithm with k= where d is the dimensionality of the full feature space A, 


| -_ _ x =argmax] (X,-x) 
2. Determine the feature -* that maximizes the criterion | ' ) where 
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Le A, -1=A,—xX ==] 
3. Remove the feature * fromthe feature set: ~ * —1= i 


4. Terminate if k equals the number of desired features, 1f not, go to step 2. 


Note 


You can find a detailed evaluation of several sequential feature algorithms 1n Comparative 
Study of Techniques for Large Scale Feature Selection, F: Ferri, P. Pudil, M. Hatef, and J. 
Kittler. Comparative study of techniques for large-scale feature selection. Pattern 
Recognition in Practice IV, pages 403—413, 1994. 


Unfortunately, the SBS algorithm 1s not implemented 1n scikit-learn, yet. But since it 1s so simple, let's 
go ahead and implement it in Python from scratch: 


from sklearn.base import clone 

ErOm Lverlools ampore CoOmbinatlrons 

import numpy as np 

EVom. Skleatn.Croscs. VaeliGgalion. 2Mpore trai test sea 
From Skilearn.metrics aAmport accuracy Score 


Class OBO): 
Oct Anat. (Sell, estimator, k teavures, 

SCOLmINgG=acCcuracy Score, 
best Size-Ve2Z0, Pendom Stace — |.) = 
self.scoring = scoring 
Selfisestimalor = GCloneiesrimalor) 
Sellek. Loacuves = K teelures 
Seli.tese Size = test size 
Sell stancol. Slave = Pancom Slave 


def fit(self, X, y): 
X train, X test, y train, y test = \ 
teen (Soo ePlLilit, Vp Best S1Z7Ze-sellsleoe 6176, 
Lenoom Stlave—selt.raneom Stare) 


aim = X Craan.shape | 1 | 
seit sinelces, = Luple( range (cim)) 
Sell .sUbeces.. = (Selt.2naices | 
SCOre = Sell. Calc SsCore(% train, Y Crain, 
em Vest, Y teoe, selti.taGi ee. ) 
Pele ecCOree = Secor] 


Wiliukes Cli 2 Se Link -(Satuiles: 
scores = [] 
subsets = [] 


bOr Pp am COMDInatLwons (Selisindices , P=Gdim— 1) 
SCOlVe = Selt, Cale Scovel Urata, Y eatin, 
x Lest, VY Gest, p) 

scores.append(score) 


subsets.append (p) WOW! eBook 
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best = np.argmax (scores) 


Sel. NOLCes. = Sudbsere | best | 
So LeU oe sop eeG (sel. neuces 2 
dim -= 1 


SG lLiweCOUes «eppena | Scores |best |) 
Feito .COLe. = Deli eocCores [-2] 


return self 


Cer brensrtormiseliy a) 
betUri wit, SClE.tndices: | 


(eT: Gane Score(scelt, 2. roin, VY train, 
x USStig VY ess, 101Ces) = 
Sei T.eSceimavor.ti1eiy Frain |: , 2nonees)|, YY £eain) 
V.pled. = Selt.cstlimacor.predicl (xX testi =, ino1ces|) 
SCOre = Sselt.scomunugty test, y pred) 
LeCUIM SCOre 


In the preceding implementation, we defined the k features parameter to specify the desired 
number of features we want to return. By default, we use the accuracy score from scikit-learn to 
evaluate the performance of a model and estimator for classification on the feature subsets. Inside the 
while loop of the £it method, the feature subsets created by the itertools.combination function 
are evaluated and reduced until the feature subset has the desired dimensionality. In each iteration, the 
accuracy score of the best subset 1s collected ina list self.scores based on the internally created 
test dataset x test. We will use those scores later to evaluate the results. The column indices of the 
final feature subset are assigned to self.indices_, which we can use via the transform method to 
return a new data array with the selected feature columns. Note that, instead of calculating the 
criterion explicitly inside the £it method, we simply removed the feature that is not contained in the 
best performing feature subset. 


Now, let's see our SBS implementation in action using the KNN classifier from scikit-learn: 


>>> from sklearn.neighbors import KNeighborsClassifier 
>>> import matplotlib.pyplot as plt 

oor Ran = BNSLGODOOLrS Clascit Ver (hn metgnbors=—Z) 

Ze SOS = SBo Anny K Teatures= 1) 

Por? SOSsvi LUA Chel. Sta, YY Erain) 


Although our SBS implementation already splits the dataset into a test and training dataset inside the 
fit function, we still fed the training dataset x train to the algorithm. The SBS fit method will 
then create new training-subsets for testing (validation) and training, which 1s why this test set is also 
called validation dataset. This approach is necessary to prevent our original test set becoming 
part of the training data. 


Remember that our SBS algorithm collects the scores of the best feature subset at each stage, so let's 


move on to the more exciting part of our implemagnatasign and plot the classification accuracy of the 
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KNN classifier that was calculated on the validation dataset. The code 1s as follows: 


oo Kk Teal = |bentk) fOr -K If SOS<.SUDSECs || 
vee DiGw woul. ©eaty, SOSsoCOres » Marker] "O°, 
2 hig Oe ty eh 

Poe DLs Vvabe bl ("ACCUracy* 

oe OlLi. kabel (*NUMber Of Beatures*) 

27> D1 4OGri1.4G 4) 

>>> plt.show() 


As we can see in the following plot, the accuracy of the KNN classifier improved on the validation 
dataset as we reduced the number of features, which 1s likely due to a decrease of the curse of 
dimensionality that we discussed in the context of the KNN algorithm in Chapter 3, A Jour of 
Machine Learning Classifiers Using Scikit-learn. Also, we can see 1n the following plot that the 
classifier achieved 100 percent accuracy for k=/5, 6, 7, 8, 9, 10}: 
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To satisfy our own curiosity, let's see what those five features are that yielded such a good 
performance on the validation dataset: 


Gee KO = IASC sbs.eubeere [6] 
Per PLANE (GOL wane.colunmns [1<] Lkol) 
Index (| ‘Alcohol’, 'Malic acid", ‘Alcalinity of ash’, "Hue', 'Proline' |, 


dtype='object') 


Using the preceding code, we obtained the column indices of the 5-feature subset from the gth 
position inthe sbs.subsets_ attribute and returned the corresponding feature names from the 
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Next let's evaluate the performance of the KNN classifier on the original test set: 


27> Kish Lek. Chal Std, Y train) 

Pee eee iG “aera ye, KiiiesCOle(. trai 2ld, VY Groat) | 
Traanaing accuracy: 0s 90s0 70907742 

eo? PETE Lest accuracy", KalwsCOlre(*% test sta, VY test) ) 

Test accuracy: 0.944444444444 


In the preceding code, we used the complete feature set and obtained ~98.4 percent accuracy on the 
training dataset. However, the accuracy on the test dataset was slightly lower (~94.4 percent), which 
is an indicator of a slight degree of overfitting. Now let's use the selected 5-feature subset and see 
how well KNN performs: 


por Kis LeIx Creat Stade, Koll, Y train) 

2o> DLiInt( Training accuracy: *, 

es Knn.SCOPE(X Cain Stal’, Kolyg VY train) ) 
Peon aeCuree ye Os 707 741930) 

vor Prine (Test accuracy. *, 

er KiiiweCOrets UeSt Stalty Kol, Y Test) ) 
Test accuracy: 0.962962962 903 


Using fewer than half of the original features in the Wine dataset, the prediction accuracy on the test 
set improved by almost 2 percent. Also, we reduced overfitting, which we can tell from the small gap 
between test (~96.3 percent) and training (~96.0 percent) accuracy. 


Note 
Feature selection algorithms in scikit-learn 


There are many more feature selection algorithms available via scikit-learn. Those include recursive 
backward elimination based on feature weights, tree-based methods to select features by importance, 
and univariate statistical tests. A comprehensive discussion of the different feature selection methods 
is beyond the scope of this book, but a good summary with illustrative examples can be found at 
http://scikit-learn.org/stable/modules/feature_selection.html. 
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Assessing feature importance with random 
forests 


In the previous sections, you learned how to use L1 regularization to zero out irrelevant features via 
logistic regression and use the SBS algorithm for feature selection. Another useful approach to select 
relevant features from a dataset is to use a random forest, an ensemble technique that we introduced in 
Chapter 3, A Jour of Machine Learning Classifiers Using Scikit-learn. Using a random forest, we 
can measure feature importance as the averaged impurity decrease computed from all decision trees 
in the forest without making any assumptions whether our data is linearly separable or not. 
Conveniently, the random forest implementation in scikit-learn already collects feature 1mportances 
for us so that we can access them via the feature importances_ attribute after fitting a 
RandomForestClassifier. By executing the following code, we will now train a forest of 10,000 
trees on the Wine dataset and rank the 13 features by their respective importance measures. 
Remember (from our discussion in Chapter 3, A Jour of Machine Learning Classifiers Using Scikit- 
learn) that we don't need to use standardized or normalized tree-based models. The code is as 
follows: 


>>> from sklearn.ensemble import RandomForestClassifier 


pro Teal. tabels = Ol wine.columns |<. 

por LOLSGSUt = RanoomrorestClassitier(h estimarore—Lo000, 
random state=0, 

cae a JORS=—1) 

PPro FOrests 2 1tbix train, Y train) 

27> AMpOLEamces = FOLreSsesrealtuLre ImpOreances _ 

2o> INALCeS = Np.argsorc (tmportrances) [:i—-1] 


27> FOL £ am Tange (x tralm.shape [l.|).% 
Drant(™s20) s=*s cE’ @ {Ef + by 30; 
Peak J abpele Lely 
. importances[indices[f]])) 
ALCON | 0.182508 


1) 

2) Malic acid 0.158574 
3) Ash 0.150954 
4) Alcalinity of ash Oe LOI SS 
5) Magnesium 0.106564 
6) Total phenols 0.078249 
7) Flavanoids 0.060717 
8) Nonflavanoid phenols 0. VSZ039 
9) Proanthocyanins 0.025385 
10) Color intensity Oe UZZ2509 
Li) Hue 0.0220 70 
12) OD280/0D315 of diluted wines 0.014655 
13) Proline OAs SS 

> 


plt.title('Feature Importances') 
Poe Pics. bal (Panoe (x Lraii.siape | tii, 
importances[indices], 
color="lightblue'’, 
align='center') 


>>> plt.xticks (range (X train.shape[1WPW! eBook 
— www.wowebook.org 


:@ 3 teat Jabels; £OLati1on— 20) 
Po Pie | iis, xX Pret negsoe pe i I) 

27 Dileweigae. ea youe.) 

eo Dit sslOwW () 


After executing the preceding code, we created a plot that ranks the different features 1n the Wine 
dataset by their relative importance; note that the feature 1mportances are normalized so that they sum 
up to 1.0. 
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0D280/0D315 of diluted wines 


We can conclude that the alcohol content of wine is the most discriminative feature in the dataset 
based on the average impurity decrease in the 10,000 decision trees. Interestingly, the three top- 
ranked features in the preceding plot are also among the top five features 1n the selection by the SBS 
algorithm that we implemented in the previous section. However, as far as interpretability is 
concerned, the random forest technique comes with an important gotcha that is worth mentioning. For 
instance, 1f two or more features are highly correlated, one feature may be ranked very highly while 
the information of the other feature(s) may not be fully captured. On the other hand, we don't need to 
be concerned about this problem if we are merely interested in the predictive performance of a model 
rather than the interpretation of feature importances. To conclude this section about feature 
importances and random forests, 1t is worth mentioning that scikit-learn also implements a transform 
method that selects features based on a user-specified threshold after model fitting, which is useful if 
we want to use the RandomForestClassifier as a feature selector and intermediate step ina scikit- 
learn pipeline, which allows us to connect different preprocessing steps with an estimator, as we will 
see in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning. For 
example, we could set the threshold to 0.15 to reduce the dataset to the 3 most important features, 
Alcohol, Malic acid, and Ash using the FON WARE CONC orc 


Poe & SCLCCLCO = TOresU.tlanstormitxn Crain, LAiresno10g=0...0) 
ae oe ere UC Oso ape 
(2245 2) 
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Summary 


We started this chapter by looking at useful techniques to make sure that we handle missing data 
correctly. Before we feed data to a machine learning algorithm, we also have to make sure that we 
encode categorical variables correctly, and we have seen how we can map ordinal and nominal 
features values to integer representations. 


Moreover, we briefly discussed LI regularization, which can help us to avoid overfitting by reducing 
the complexity of a model. As an alternative approach for removing irrelevant features, we used a 
sequential feature selection algorithm to select meaningful features from a dataset. 


In the next chapter, you will learn about yet another useful approach to dimensionality reduction: 
feature extraction. It allows us to compress features onto a lower dimensional subspace rather than 
removing features entirely as 1n feature selection. 
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Chapter 5. Compressing Data via 
Dimensionality Reduction 


In Chapter 4, Building Good Training Sets — Data Preprocessing, you learned about the different 
approaches for reducing the dimensionality of a dataset using different feature selection techniques. 
An alternative approach to feature selection for dimensionality reduction 1s feature extraction. In this 
chapter, you will learn about three fundamental techniques that will help us to summarize the 
information content of a dataset by transforming it onto a new feature subspace of lower 
dimensionality than the original one. Data compression 1s an important topic in machine learning, and 
it helps us to store and analyze the increasing amounts of data that are produced and collected in the 
modern age of technology. In this chapter, we will cover the following topics: 


e Principal component analysis (PCA) for unsupervised data compression 

e Linear Discriminant Analysis (LDA) as a supervised dimensionality reduction technique for 
maximizing class separability 

e Nonlinear dimensionality reduction via kernel principal component analysis 
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Unsupervised dimensionality reduction via 
principal component analysis 


Similar to feature selection, we can use feature extraction to reduce the number of features ina 
dataset. However, while we maintained the original features when we used feature selection 
algorithms, such as sequential backward selection, we use feature extraction to transform or project 
the data onto a new feature space. In the context of dimensionality reduction, feature extraction can be 
understood as an approach to data compression with the goal of maintaining most of the relevant 
information. Feature extraction is typically used to improve computational efficiency but can also 
help to reduce the curse of dimensionality—especially if we are working with nonregularized 
models. 


Principal component analysis (PCA) is an unsupervised linear transformation technique that is 
widely used across different fields, most prominently for dimensionality reduction. Other popular 
applications of PCA include exploratory data analyses and de-noising of signals in stock market 
trading, and the analysis genome data and gene expression levels in the field of bioinformatics. PCA 
helps us to identify patterns in data based on the correlation between features. In a nutshell, PCA aims 
to find the directions of maximum variance 1n high-dimensional data and projects it onto a new 
subspace with equal or fewer dimensions that the original one. The orthogonal axes (principal 
components) of the new subspace can be interpreted as the directions of maximum variance given the 
constraint that the new feature axes are orthogonal to each other as illustrated in the following figure. 


A 


| and ~* 


are the original feature axes, and PC1 and PC2 are the principal components: 


Here, 
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If we use PCA for dimensionality reduction, we construct a dK _dimensional transformation matrix 
W’ that allows us to map a sample vector * onto a new k-dimensional feature subspace that has 


fewer dimensions than the original a -dimensional feature space: 


md 
x= x, BE spaceing Mt , xe’ 


LxW, WeR™ 


- 23] By Borgeniegy by zeER* 


As a result of transforming the original d _dimensional data onto this new “ -dimensional subspace 


(typically * <<), the first principal component will have the largest possible variance, and all 
consequent principal components will have the largest possible variance given that they are 
uncorrelated (orthogonal) to the other principal components. Note that the PCA directions are highly 
sensitive to data scaling, and we need to standardize the features prior to PCA if the features were 
measured on different scales and we want to assign equal importance to all features. 


Before looking at the PCA algorithm for dimensionality reduction in more detail, let's summarize the 
approach 1n a few simple steps: 


Standardize the @ -dimensional dataset. 
Construct the covariance matrix. 
Decompose the covariance matrix into its eigenvectors and eigenvalues. 


aS 


Select * eigenvectors that correspond to the k largest eigenvalues, where K is the 
dimensionality of the new feature subspace (fh Sd ). 


Construct a projection matrix W’ from the "top" k eigenvectors. 


i 


6. Transform the @ -dimensional input dataset * using the projection matrix W’ to obtain the new 


k-dimensional feature subspace. 
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Total and explained variance 


In this subsection, we will tackle the first four steps of a principal component analysis: standardizing 
the data, constructing the covariance matrix, obtaining the eigenvalues and eigenvectors of the 
covariance matrix, and sorting the eigenvalues by decreasing order to rank the eigenvectors. 


First, we will start by loading the Wine dataset that we have been working with in Chapter 4, 
Building Good Training Sets — Data Preprocessing: 


>>> import pandas as pd 
>>> df wine = pd.read csv('https://archive.ics.uci.edu/ml/machine-learning- 
databases/wine/wine.data', header=None) 


Next, we will process the Wine data into separate training and test sets—using 70 percent and 30 
percent of the data, respectively—and standardize it to unit variance. 


vor EEOM SK leat h.wCross Valrvoarion aMmport train Lest. splice 
>>> from sklearn.preprocessing import StandardScaler 
Poe ky VY = Of. Wiles oc +, to] «Valves, Gf wine.tlocls,. Ulevalves 
>>> X train, X_test, y train, y test = \ 
teal ©eoe Solty, V7; 
cag Leste S1Z6-0.50, Ltandom svate=0) 
>>> sc = StandardScaler() 
Peo & Ue 6.0 = SC. tren fOr xX Craoa2) 
eo Kk Cesk SLO. = Seni Transtorii™ Lest) 


After completing the mandatory preprocessing steps by executing the preceding code, let's advance to 
the second step: constructing the covariance matrix. The symmetric d*d _dimensional covariance 


matrix, where @ is the number of dimensions in the dataset, stores the pairwise covariances between 


the different features. For example, the covariance between two features ' and ** on the population 
level can be calculated via the following equation: 

7 I A a, ‘) 7 | hs i) 7 
© ik i * | 4 j lon | ( X ff; 


nc 


l j . 
Here, Po and “* are the sample means of feature / and K respectively. Note that the sample means 


are zero if we standardize the dataset. A positive covariance between two features indicates that the 
features increase or decrease together, whereas a negative covariance indicates that the features vary 
in opposite directions. For example, a covariance matrix of three features can then be written as (note 


that ~ stands for the Greek letter sigma, which is not to be confused with the sum symbol): 
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~~ 
Q 


it 
Q 
why 
2 


Q 
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The eigenvectors of the covariance matrix represent the principal components (the directions of 
maximum variance), whereas the corresponding eigenvalues will define their magnitude. In the case 


of the Wine dataset, we would obtain 13 eigenvectors and eigenvalues from the 13x13 _dimensional 
covariance matrix. 


Now, let's obtain the eigenpairs of the covariance matrix. As we surely remember from our 
introductory linear algebra or calculus classes, an eigenvalue " satisfies the following condition: 


r= AY 


Here, * is a scalar: the eigenvalue. Since the manual computation of eigenvectors and eigenvalues is 
a somewhat tedious and elaborate task, we will use the linalg.eig function from NumPy to obtain 
the eigenpairs of the Wine covariance matrix: 


>>> import numpy as np 

27> COV Mal = Tp.Cov(x% Train Sta.) 

PoP SLOGeth Vale, SLGen VeCce = Np. l1ielo.619g (cov mar) 

>>> print('\nEigenvalues \n%s' % eigen vals) 

Eigenvalues 

[ 4.8923083 2.46635032 1.42809973 1.01233462 0.84906459 0.60181514 
O2<DZZoOL546 OU.0CtI4@e4o U23305I429 O.29595016 OG4l65sT2542 U2214327212 
eZ aII95DS: | 


Using the numpy.cov function, we computed the covariance matrix of the standardized training 
dataset. Using the linalg.eig function, we performed the eigendecomposition that yielded a vector 
(eigen vals) consisting of 13 eigenvalues and the corresponding eigenvectors stored as columns 1n 


a 13x13 dimensional matrix (eigen vecs). 


Since we want to reduce the dimensionality of our dataset by compressing it onto a new feature 
subspace, we only select the subset of the eigenvectors (principal components) that contains most of 
the information (variance). Since the eigenvalues define the magnitude of the eigenvectors, we have to 


sort the eigenvalues by decreasing magnitude; we are interested in the top k eigenvectors based on 


the values of their corresponding eigenvalues. But before we collect those * most informative 
eigenvectors, let's plot the variance explaing Fiatioscat the, eigenvalues. 


. ; . . A... A, 
The variance explained ratio of aneigenvalue / is simply the fraction of aneigenvalue / and the 
total sum of the eigenvalues: 


Ay 
Liat 


Using the NumPy cumsum function, we can then calculate the cumulative sum of explained variances, 
which we will plot via matplotlib's step function: 


ver OL. = SUMCeLGen Vale) 

>>> var exp = [(1i / tot) for i in 

44 sorted(eigen vals, reverse=True) | 
eer Cul. Var Cxp = DNp.CUmMsSUmM (Var xp) 


o> AMpOrt. Matp hot lab.pyolot as: ple 

por Pit ebaer( range tl, 14), Var exp, alphe-0.0, align—"Center’, 
ark label='individual explained variance') 

27> PLUsStep (range (1,14), Cum Vat Gxp,;, where="m1d", 

cee label='cumulative explained variance') 

>>> plt.ylabel (‘Explained variance ratio') 

>>> plt.xlabel('Principal components') 

>>> plt.legend(loc='best') 

>>> plt.show() 


The resulting plot indicates that the first principal component alone accounts for 40 percent of the 
variance. Also, we can see that the first two principal components combined explain almost 60 
percent of the variance in the data: 
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— cumulative explained variance 
(SH) individual explained variance 


Explained variance ratio 





“0 2 4 ; 8 10 12 14 
Principal components 


Although the explained variance plot reminds us of the feature importance that we computed in 
Chapter 4, Building Good Training Sets — Data Preprocessing, via random forests, we shall remind 
ourselves that PCA 1s an unsupervised method, which means that information about the class labels 1s 
ignored. Whereas a random forest uses the class membership information to compute the node 
impurities, variance measures the spread of values along a feature axis. 
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Feature transformation 


After we have successfully decomposed the covariance matrix into eigenpairs, let's now proceed 
with the last three steps to transform the Wine dataset onto the new principal component axes. In this 
section, we will sort the eigenpairs by descending order of the eigenvalues, construct a projection 
matrix from the selected eigenvectors, and use the projection matrix to transform the data onto the 
lower-dimensional subspace. 


We start by sorting the eigenpairs by decreasing order of the eigenvalues: 


Pro SiIGenm Pairs =| (ip.abs (e1gen. Vals |i |) ,elgen veces i<,4.)) 
er fOr a Inrange (len (eigen vals).) | 
27? SIGN PailS.so0re(reverse—1ruc) 


Next, we collect the two eigenvectors that correspond to the two largest values to capture about 60 
percent of the variance in this dataset. Note that we only chose two eigenvectors for the purpose of 
illustration, since we are going to plot the data via a two-dimensional scatter plot later 1n this 
subsection. In practice, the number of principal components has to be determined from a trade-off 
between computational efficiency and the performance of the classifier: 


eo? W= Np.nstack((e1gen pairs (Ul) [li ile, Dp«newaxis |; 
- 28 Sergei pairs lil tills; Dpwntewaxrs|)) 
>>> print ('Matrix W:\n',w) 

Matrix W: 

[[ 0.14669811 0.50417079] 

[=0.Z24224554 0.24216889) 

[-0.02993442 0.28698484] 

[=O ,.2505L9002 =—0. 064607 1.6 | 

[| O212079772 0422995365] 

[ 0.38934455 0.09363991 |] 

[ Us4ZS26406 O.010e8G2Z2 | 

[=04506034956. 0.01670Z216 

[ 0.30572219 0.03040352)] 

[-0.09869191 0.54527081] 

[ 0.30032535 -0.27924322 | 

[ UO. S68ZL154: =—0.174365 | 

[ O.29259713 02430315461) ] 


By executing the preceding code, we have created a |3x2_dimensional projection matrix W’ from 
the top two eigenvectors. Using the projection matrix, we can now transforma sample * 


(represented as |x 13_dimensional row vector) onto the PCA subspace obtaining x , anow two- 
dimensional sample vector consisting of two new features: 


x =xW 


oS & Train Std] 0) .dot (wy) WOW! eBook 
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array(:) Z.09891L620, O.004c4009)) 


Similarly, we can transform the entire 12413 _dimensional training dataset onto the two principal 
components by calculating the matrix dot product: 


X'=Xw 


Por Creal. PCa = xX vrai Sea.couriw) 


Lastly, let's visualize the transformed Wine training set, now stored as an !24* 2 -dimensional matrix, 
in a two-dimensional scatterplot: 


SS Colores = fr", "by, “as 

Peo Markers = [*S", st, "O* 

27> LOL I, Cy, M an Zipinp.Unsouely train), Colors, Markers) 
Plesccal cer tie Peoaly Cratt==i, Ul, 


xX train: pcaly train==—l, 1), 
ame c=c, label=l, marker=m) 
Per Dit«xlabel:(* PC 1*) 
yo Dose look Pe Zs) 
>>> plt.legend(loc='"'lower left') 
>>> plt.show() 


As we can see in the resulting plot (shown in the next figure), the data is more spread along the x-axis 
—the first principal component—than the second principal component (y-axis), which 1s consistent 
with the explained variance ratio plot that we created in the previous subsection. However, we can 
intuitively see that a linear classifier will likely be able to separate the classes well: 
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Although we encoded the class labels information for the purpose of illustration in the preceding 
scatter plot, we have to keep 1n mind that PCA 1s an unsupervised technique that doesn't use class 
label information. 
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Principal component analysis in scikit-learn 


Although the verbose approach in the previous subsection helped us to follow the inner workings of 
PCA, we will now discuss how to use the pca class implemented in scikit-learn. Pca is another one 
of scikit-learn's transformer classes, where we first fit the model using the training data before we 
transform both the training data and the test data using the same model parameters. Now, let's use the 
Pca from scikit-learn on the Wine training dataset, classify the transformed samples via logistic 
regression, and visualize the decision regions via the plot decision region function that we 
defined in Chapter 2, Training Machine Learning Algorithms for Classification: 


from matplotlib.colors import ListedColormap 
GCGe plor Ceci tO) regione, VY, Clacslite, 2esO1ul1on— 7,02) 


# setup marker generator and color map 


Markers = (*s*, “ss, “Ot, 28%, 837") 

colors = ('red', ‘blue', ‘lightgreen', ‘'gray', 'cyan') 

cmap = ListedColormap(colors[:len(np.unique(y) ) ]) 

# plot the decision surface 

x] min, 21 Mex = Ais, Vilemant) = Ly Alley VUlsemax() @ J 

KZ Min, X24 Mex = Als, Lismant) = Lt, Ale, Ll.~max() + 1 

xXl, 2X2 = MpPamMesnoqrira(np.erange(x] min; xl max, TesSoOlucion) ,; 
Np.ahange(x2 Manly, x2 Mex, 2esolulion)) 

Z = Classifier.predict(np.array([xxl.ravel(), xx2.ravel()]).T) 


Z= Z2.reshape(xxl.shape) 

DLC. COnTOULE (xxl, xXx2, A, alpha=-U;.4, Cmap—Cmap) 
plt.xlim(xxl.min(), xxl.max() ) 

DLE eVlamtxx Zin), XXZ.max()) 


# plot class samples 
for idx, cl in enumerate (np.unique(y)): 
plt.scatter(x=X[y == cl, OJ], y=X[y == cl, 1], 
alpha=0.8, c=cmap(1dx), 
marker=markers[1idx], label=cl1) 


Zo LeOMm SKiGATN.«LiInGar Mocel Import Img St TeCReoressi0n 
>>> from sklearn.decomposition import PCA 


27 Deca. = FPCA(n Component s=zZ) 
>>> lr = LogisticRegression() 
27> K Crain pCa = pCa. Tit Lransform(xX train std) 


veo x Lest PCa = pCa.transtorm(x Test Sid) 

Fo ieee bie, Dea, WF ein) 

poe DOU “SC STOn Teg1oOns(% brain pCa, VY train, @CLossi ter =L7) 
por Diic. cape ("PCL ) 

>>> plt.ylabel ('PC2') 

>>> plt.legend(loc='lower left") 

ee DPLLsshow () 


By executing the preceding code, we should now see the decision regions for the training model 


reduced to the two principal component axes. WOW! eBook 
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If we compare the PCA projection via scikit-learn with our own PCA implementation, we notice that 
the plot above is a mirror image of the previous PCA via our step-by-step approach. Note that this 1s 
not due to an error 1n any of those two implementations, but the reason for this difference 1s that, 
depending on the eigensolver, eigenvectors can have either negative or positive signs. Not that it 
matters, but we could simply revert the mirror 1mage by multiplying the data with -1 1f we wanted to; 
note that eigenvectors are typically scaled to unit length 1. For the sake of completeness, let's plot the 
decision regions of the logistic regression on the transformed test dataset to see if it can separate the 
classes well: 


vee DLL. MCC e101. regrOne Lest. pCa, Vo test, Clesot ter -i) 
mee Piles eae lL * Per” | 

>>> plt.ylabel ('PC2') 

>>> plt.legend(loc='lower left") 

>>> plt.show() 


After we plot the decision regions for the test set by executing the preceding code, we can see that 
logistic regression performs quite well on this small two-dimensional feature subspace and only 
misclassifies one sample 1n the test dataset. 
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If we are interested in the explained variance ratios of the different principal components, we can 
simply initialize the pca class with the n components parameter set to None, so all principal 
components are kept and the explained variance ratio can then be accessed via the 
explained variance ratio attribute: 


Zee DCda = PUA Component s=None) 

vor &. eidim. pea = pCa. iG Cranston xX thein 62d) 

Per PCavexplailned. Variance ratio. 

errayt|: UsotoZ70te, Us lSCleIZ2G, UslOooorol, Us I24309; UseQVO 0075, 
Os05 5092014, Us.039009750, UsGZo7ZI914, Usl0Z250101, U.016309724, 
Ue0lGo3556, VeVlZoezil, O.0068Z0 70) 


Note that we set n_components=None when we initialized the PCA class so that it would return all 
principal components in sorted order instead of performing a dimensionality reduction. 
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Supervised data compression via linear 
discriminant analysis 


Linear Discriminant Analysis (LDA) can be used as a technique for feature extraction to increase 
the computational efficiency and reduce the degree of over-fitting due to the curse of dimensionality 
in nonregularized models. 


The general concept behind LDA 1s very similar to PCA, whereas PCA attempts to find the 
orthogonal component axes of maximum variance in a dataset; the goal in LDA 1s to find the feature 
subspace that optimizes class separability. Both LDA and PCA are linear transformation techniques 
that can be used to reduce the number of dimensions in a dataset; the former is an unsupervised 
algorithm, whereas the latter is supervised. Thus, we might intuitively think that LDA is a superior 
feature extraction technique for classification tasks compared to PCA. However, A.M. Martinez 
reported that preprocessing via PCA tends to result in better classification results in an image 
recognition task in certain cases, for instance, if each class consists of only a small number of 
samples (A. M. Martinez and A. C. Kak. PCA Versus LDA. Pattern Analysis and Machine 
Intelligence, IEEE Transactions on, 23(2):228—233, 2001). 


Note 


Although LDA is sometimes also called Fisher's LDA, Ronald A. Fisher initially formulated Fisher's 
Linear Discriminant for two-class classification problems in 1936 (R. A. Fisher. The Use of 
Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(2):179—188, 1936). Fisher's 
Linear Discriminant was later generalized for multi-class problems by C. Radhakrishna Rao under 
the assumption of equal class covariances and normally distributed classes in 1948, which we now 
call LDA(C. R. Rao. The Utilization of Multiple Measurements in Problems of Biological 
Classification. Journal of the Royal Statistical Society. Series B (Methodological), 10(2):159-203, 
1948). 


The following figure summarizes the concept of LDA for a two-class problem. Samples from class 1 
are Shown as crosses and samples from class 2 are shown as circles, respectively: 
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A linear discriminant, as shown on the x-axis (LD 1), would separate the two normally distributed 
classes well. Although the exemplary linear discriminant shown on the y-axis (LD 2) captures a lot of 
the variance 1n the dataset, it would fail as a good linear discriminant since it does not capture any of 
the class-discriminatory information. 


One assumption in LDA 1s that the data 1s normally distributed. Also, we assume that the classes have 
identical covariance matrices and that the features are statistically independent of each other. 
However, even if one or more of those assumptions are slightly violated, LDA for dimensionality 
reduction can still work reasonably well (R. O. Duda, P. E. Hart, and D. G. Stork. Pattern 
Classification. 2nd. Edition. New York, 2001). 


Before we take a look into the inner workings of LDA in the following subsections, let's summarize 
the key steps of the LDA approach: 


. Standardize the @ -dimensional dataset (a is the number of features). 


. For each class, compute the d _dimensional mean vector. 


T 


l 
. Compute the eigenvectors and corresponding eigenvalues of the matrix 5,35 . 


l 

2 

3. Construct the between-class scatter matrix S; and the within-class scatter matrix Sy 

4 

5. Choose the * eigenvectors that correspond to the kK largest eigenvalues to construct a aXK | 
dimensional transformation matrix ; the eigenvectors are the columns of this matrix. 


6. Project the samples onto the new feature subspace using the transformation matrix a 
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Note 


The assumptions that we make when we are using LDA are that the features are normally distributed 
and independent of each other. Also, the LDA algorithm assumes that the covariance matrices for the 
individual classes are identical. However, even if we violate those assumptions to a certain extent, 
LDA may still work reasonably well in dimensionality reduction and classification tasks (R. O. Duda, 
P. E. Hart, and D. G. Stork. Pattern Classification. 2nd. Edition. New York, 2001). 
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Computing the scatter matrices 


Since we have already standardized the features of the Wine dataset 1n the PCA section at the 
beginning of this chapter, we can skip the first step and proceed with the calculation of the mean 
vectors, which we will use to construct the within-class scatter matrix and between-class scatter 


matrix, respectively. Each mean vector ™; stores the mean feature value “" with respect to the 


samples of class !: 


ee 
fi; = > x, 


Ny xed 


This results in three mean vectors: 


ft fi _dalcohal 


Pe wales id an | 7 
m, = | @e41,2,3} 


- 


i f; , proline 


27> WpasSeUl:. PLimMeopelons (precis10n=4) 
>>> mean _vecs = [] 
>>> for label in range(1,4): 
Mean. VeCs.appena (np.meal. | 
xX Crain Staly Crain-—Label), axis—0),) 
Princ ( "MY <s¢ 2s \n" o(label, mean vecs|label-1))) 


My 2s | Wae9Zo7 =. 0090 Ue2077 -U,7900 Oeo057 Va9G0S 1u0olo -0.050G 5354 


Gaze? U24055 U.S LoZon 7s 


MY 2s. -Os0727 -0.5004 -0.4437 U2.2461 =-0.2409 =—0.12059 0.0107 -O.0264 W.1095 


Seo 1 260 Ost 392 Vet 7 =—O4 701 


MY 23 | OgtGo? e002) O.50277 U.25656 =,01 =O. 0 Hhew Zo e746 Un FO5Z 


Ugo ~Letolo —1.5007 -0.c7912) 


| 


Using the mean vectors, we can now compute the within-class scatter matrix ~": 
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This 1s calculated by summing up the individual scatter matrices S, of each individual class !: 


Ss, = y (x—m,)(x—m,) 


xeD. 


>>> Gd = 13 4 number of features 
yee o.W = Dp. ZGros( (a, <d)) 
Poo TOL JLabel,my 20 Zip (range (1,4), mean vecs) 


Class Scatter = Np«7er0s( (Oy d)) 

for row in X[y == label]: 
row, mv = row.reshape(d, 1), mv.reshape(d, 1) 
Class SCalter t= (LoOw-m7) COL (row-=my) .1) 


> WW T= Clase Scatter 
>> print (' Within-class scatter matrix: %sx%s' 
+ (S W.shape[0], S W.shape[1])) 
Within- class scatter matrix: 13x13 


The assumption that we are making when we are computing the scatter matrices is that the class labels 
in the training set are uniformly distributed. However, if we print the number of class labels, we see 
that this assumption is violated: 


yer Drage ( Class Jabel. Grete bucions 26" 


O 


© MND. CIncoune (yy iar) | 2s 1) 
Pisce label distribution: [40 49 35] 


| 


Thus, we want to scale the individual scatter matrices S, before we sum them up as scatter matrix 
S... When we divide the scatter matrices by the number of class samples N, , we can See that 
computing the scatter matrix is 1n fact the same as computing the covariance matrix 2‘, . The 
covariance matrix is a normalized version of the scatter matrix: 


l i < 


», =—S, =— pa (x—m,)(x—m, ) 


N I 
N Ne xe D, 
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>>> d = 13 # number of features 
yor oD W: > NpsZeros ( (a, OG) ) 
poor FOL Label,my 1n. Zi plrange (ly 4), Mean vecs): 
Class SCatter = Npwcov (x Crain Sud ly <rain-——1aoe1 |.) 
DW T= Class SCelver 
>>> print('Scaled within-class scatter matrix: %sx%s' 
« (Oo W.Shape [0], . W«sheape| 1) ) 
aeued WLthan=-class sScaltuter matrix: Loxls 


After we have computed the scaled within-class scatter matrix (or covariance matrix), we can move 


a | 


on to the next step and compute the between-class scatter matrix Sy 


S..= DN, (m,—m)(m,—m y 


Here, “' is the overall mean that 1s computed, including samples from all classes. 


Pre Mean. Overall = Np.Mean(x~ Crain sta, axis=0) 
>>> d = 13 # number of features 

227 Oo.» = NDeZeros. (ad, O)) 

>>> £Or i,mean Vec Li enumerate (mGan Vecs) = 


n = X[y==1+t1, :].shape[0] 
Mean Vec = Mean Vec.resnape (a, 1) 
Mean OVeral. = Mean .overalt,reshape (a, 4.) 
SB r= 1 > (Mean vec = mean Overall) «cot | 
(meen VEC. = Mean. Overall). 1) 


print ('Between-class scatter matrix: %sSx%s' 
6 (o B.shape[O], S B.shape|l])) 
Between-class scatter matrix: 13x13 
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Selecting linear discriminants for the new feature 
subspace 


The remaining steps of the LDA are similar to the steps of the PCA. However, instead of performing 

the eigendecomposition on the covariance matrix, we solve the generalized eigenvalue problem of the 
L 

matrix 5.35 

>>>eigen vals, eigen vecs =\ 


:ohDy Linalo.e1o (Np «slinabog.i nv CS W).<COtC(S B)) 


After we computed the eigenpairs, we can now sort the eigenvalues in descending order: 


yo GOem. Pate = | (Wpwaibe (elcen. Velde Seen. Veco | 7.11) 
gs FOr 1. 10. range (len (e1gen vals).) | 
Pork Sie! Dee = POlLOc (Cle Deira, 


a key=lambda k: k[O], reverse=True) 
>>> print('Eigenvalues in decreasing order:\n') 
eo EOL. Crgem Vad. 1 SLOCh ‘patie. 

Diine (e1gen vais ).) 


Eigenvalues in decreasing order: 


643,0153564546 
2292006090 1854 

Loo MACOS SIC4e=13 
-68434188608e-14 
-16877714935e-14 
L607 77149359e—14 
1/67/5590 L616le-14 
~/544790902e-14 
» 1044-79090 Z2Ze>-14 
rSUZIIZI99009E—14 
~sIZISZI99059e-—14 
-9101018959e-14 
-3 0001693: /97e=—16 


OrFNN WwW Wee BB Ol 


Those who are a little more familiar with linear algebra may know that the rank of the dxd _ 


dimensional covariance matrix can be at most @ —!, and we can indeed see that we only have two 
nonzero eigenvalues (the eigenvalues 3-13 are not exactly zero, but this is due to the floating point 
arithmetic in NumPy). Note that in the rare case of perfect collinearity (all aligned sample points fall 
on a straight line), the covariance matrix would have rank one, which would result in only one 
eigenvector with a nonzero eigenvalue. 


To measure how much of the class-discriminatory information 1s captured by the linear discriminants 
(eigenvectors), let's plot the linear discriminants by decreasing eigenvalues similar to the explained 
variance plot that we created in the PCA section. or simplicity, we will call the content of the class- 
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discriminatory information discriminability. 


oor LOL = sum (eigen vals.real) 

Soo Giscr = (|(1. 7 VOL) 2Or 1 am sorlvedi(eiger valssreal, Téverse-1 rue) | 
27 CUM GLSCr = UpP.CUMsSum(d1Scr) 

>>> plt.bar(range(l, 14), discr, alpha=0.5, align='center', 
Ses label='individual "discriminability"') 

>>> plt.step(range(1l, 14), cum _discr, where='mid", 

sae label='cumulative "discriminability"') 

2o> D1t.vlabel (*"daseriminabiiarty” ratio") 

>>> plt.xlabel('Linear Discriminants'") 

eo Pill, ae foes. sh aly 

>>> plt.legend(loc='"best') 

For Ole «Siow () 


As we can see in the resulting figure, the first two linear discriminants capture about 100 percent of 
the useful information 1n the Wine training dataset: 


/—— cumulative "discriminability" | 


ESS individual "discriminability" 


"discriminability” ratio 





0 z 4 6 8 10 1? 14 
Linear Discriminants 


Let's now stack the two most discriminative eigenvector columns to create the transformation matrix 
W . 


>>> w = np.hstack( (eigen pairs([O)[l][:, np.newaxis].real, 
ee e1genm paira ii) ~L) sy Mp.«Mewaxis)].real)) 
>>> print('Matrix W:\n', w) 

Matrix W: 


LiHO2e07Ur =O.3 776.) 
|; Vesa? =042225)) 


[=—O..0265 =O.3615] WOW! eBook 
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(). @ © ‘ <2 OD DO © © 3 


250 
~.0143 
io: 
Zien 
207126 
gal BOT 
22.709 
22576 
0867 
soOeO 


Kf Lk kk kk 
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Projecting samples onto the new feature space 


Using the transformation matrix W’ that we created in the previous subsection, we can now transform 
the training data set by multiplying the matrices: 


X'=XW 


PPP & train dea = x Crem Std.cor (Ww) 
Poe COlLOrS = | *iet, tom, oP | 
PoS Markers = |*s', *X*, *O* | 
Pee Or jp ©, Mm In ZI pi ip.untoue (7 trol), CcoOLors, Markers) = 
Dit~sCcatter(® train Joely trari--L, 01, 
x Crain, oa ly Crasana==ky Jl, 
se c=c, label=l, marker=m) 
Per Dit x aoe. (* LD i*) 
ooo DLE. Vy babel ( LD (2%) 
>>> plt.legend(loc="upper right') 
>>> plt.show() 


As we can see in the resulting plot, the three wine classes are now linearly separable in the new 
feature subspace: 


LD 2 








LD 1 
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LDA via scikit-learn 


The step-by-step implementation was a good exercise for understanding the inner workings of LDA 
and understanding the differences between LDA and PCA. Now, let's take a look at the Lpa class 
implemented in scikit-learn: 


>>> from sklearn.lda import LDA 
pro a. = LDA(m Components=Z) 
poe 2 trot de = 1a. tei orOmax Tree Sec, 5 tei) 


Next, let's see how the logistic regression classifier handles the lower-dimensional training dataset 
after the LDA transformation: 


>>> lr = LogisticRegression() 

eo ce St Diet ea a: 4 ee) 

Pe DLOU GWeCasion tegrons (x train ida, VY urain, Closer. er—Lr) 
Por DLE ek label ("LD 1") 

>>> plt.ylabel (‘LD 2") 

>>> plt.legend(loc='lower left") 

>>> plt.show () 


Looking at the resulting plot, we see that the logistic regression model misclassifies one of the 
samples from class 2: 





By lowering the regularization strength, we could probably shift the decision boundaries so that the 
logistic regression models classify all samplegay,the fgaining dataset correctly. However, let's take a 
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look at the results on the test set: 


yo? om Lesn da = 1dasCransrorm(® Lest. SG) 

Poo DIG CSC le70On. toi tect 1a, Yes, Clee eee 
eo DLtextabel (GD 1") 

>>> plt.ylabel (‘LD 2") 

>>> plt.legend(loc='lower left') 

>>> plt.show () 


As we can see in the resulting plot, the logistic regression classifier is able to get a perfect accuracy 
score for classifying the samples in the test dataset by only using a two-dimensional feature subspace 
instead of the original 13 Wine features: 
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Using Kernel principal component analysis for 
nonlinear mappings 


Many machine learning algorithms make assumptions about the linear separability of the input data. 
You learned that the perceptron even requires perfectly linearly separable training data to converge. 
Other algorithms that we have covered so far assume that the lack of perfect linear separability is due 
to noise: Adaline, logistic regression, and the (standard) support vector machine (SVM) to just 
name a few. However, if we are dealing with nonlinear problems, which we may encounter rather 
frequently 1n real-world applications, linear transformation techniques for dimensionality reduction, 
such as PCA and LDA, may not be the best choice. In this section, we will take a look at a kernelized 
version of PCA, or kernel PCA, which relates to the concepts of kernel SVM that we remember from 
Chapter 3, A Jour of Machine Learning Classifiers Using Scikit-learn. Using kernel PCA, we will 
learn how to transform data that is not linearly separable onto a new, lower-dimensional subspace 
that 1s suitable for linear classifiers. 


Linear vs. nonlinear problems 
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Kernel functions and the kernel trick 


As we remember from our discussion about kernel SVMs in Chapter 3, 4 Tour of Machine Learning 
Classifiers Using Scikit-learn, we can tackle nonlinear problems by projecting them onto a new 
feature space of higher dimensionality where the classes become linearly separable. To transform the 


samples *© I" onto this higher k-dimensional subspace, we defined a nonlinear mapping function 


@p , 


d:R° +R" (k>>d) 


We can think of ? as a function that creates nonlinear combinations of the original features to map the 


original d _dimensional dataset onto a larger, k-dimensional feature space. For example, if we had 


(d =2) 


feature vector ¥ER* (© 1s a column vector consisting of d features) with two dimensions 
potential mapping onto a 3D space could be as follows: 


I 
x =[x,%] 


,a 


Lo 


In other words, via kernel PCA we perform a nonlinear mapping that transforms the data onto a 
higher-dimensional space and use standard PCA 1n this higher-dimensional space to project the data 
back onto a lower-dimensional space where the samples can be separated by a linear classifier 
(under the condition that the samples can be separated by density in the input space). However, one 
downside of this approach 1s that it 1s computationally very expensive, and this is where we use the 
kernel trick. Using the kernel trick, we can compute the similarity between two high-dimension 
feature vectors in the original feature space. 


Before we proceed with more details about using the kernel trick to tackle this computationally 
expensive problem, let's look back at the standare FCA‘ approach that we implemented at the 


w.wowebo 


beginning of this chapter. We computed the covariance between two features K and / as follows: 


=¥x, = ()} 


Since the standardizing of features centers them at mean zero, for instance, ial , we can 


simplify this equation as follows: 


low 
O ig — a - Ve. j A kL 
ty 


Note that the preceding equation refers to the covariance between two features; now, let's write the 
general equation to calculate the covariance matrix d 


(eee 
y= — Sxl? xl 
aa 


Bernhard Scholkopf generalized this approach (B. Scholkopf, A. Smola, and K.-R. Muller. Kernel 
Principal Component Analysis. pages 583-588, 1997) so that we can replace the dot products 


between samples in the original feature space by the nonlinear feature combinations via ?. 


2, = a o(x" ) p(x" y! 


Mia 


To obtain the eigenvectors—the principal components—from this covariance matrix, we have to 
solve the following equation: 


>= => (x! p(x)! 


Mia 
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AT 


= ~y9(x" p(x" ) y=Ayp 


i I=] 


D(x") 


H =] 


—Y =— J 9x") 4(x” ) y=— 


Here, “ and ” are the eigenvalues and eigenvectors of the covariance matrix pa , and “ canbe 
obtained by extracting the eigenvectors of the kernel (similarity) matrix A as we will see in the 
following paragraphs. 


The derivation of the kernel matrix is as follows: 


First, let's write the covariance matrix as 1n matrix notation, where O( ) is an ?** _dimensional 


matrix: 


El So(x)o(a0J =} 


i 


o(X) o(X) 


Now, we can write the eigenvector equation as follows: 


= v=— Du! 'o(x)=Ag(X)' a 


Since ~”= 4” we get: 


“4(X)' 6(X)o(XY a= 26(X)' a 
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Multiplying it by A(X) on both sides yields the following result: 


1 9(X)9(X)' (XO XJ! a=29(X)O(XY' a 


= —6(X)o(X) a= 2a 


I] 


| 
=>—Ka=Aa 
rT 


Here, A is the similarity (kernel) matrix: 


K =9(X)9(X) 


r 


As we recall from the SVM section in Chapter 3, A Jour of Machine Learning Classifiers Using 
Scikit-learn, we use the kernel trick to avoid calculating the pairwise dot products of the samples * 


under ? explicitly by using a kernel function 4 so that we don't need to calculate the eigenvectors 
explicitly: 


k (x! x) =4(x" ) o(x"”) 


In other words, what we obtain after kernel PCA are the samples already projected onto the 
respective components rather than constructing a transformation matrix as in the standard PCA 
approach. Basically, the kernel function (or simply kernel) can be understood as a function that 
calculates a dot product between two vectors—a measure of similarity. 


The most commonly used kernels are as follows: 
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e The polynomial kernel: 


k(x" ) pa) mail ao xh +0) 


Here, @ is the threshold and ? is the power that has to be specified by the user. 
e The hyperbolic tangent (sigmoid) kernel: 


k Lara | = tanh { nx x) +6 


e The Radial Basis Function (RBF) or Gaussian kernel that we will use in the following 
examples in the next subsection: 


(7) (7) 


iL — VT | 























(7) (7) x 
I 
se )=exp : 
20° 
Itis also written as follows: 
k(x" 2!) =exp{-y yl) — | 


To summarize what we have discussed so far, we can define the following three steps to implement an 
RBF kernel PCA: 


1. We compute the kernel (similarity) matrix K where we need to calculate the following: 


(i) 7) | 


Ba, 














k ie x) = — exp(-7 


We do this for each pair of samples: 
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For example, if our dataset contains 100 training samples, the symmetric kernel matrix of the 
pair-wise similarities would be 100x100 dimensional. 


2. We center the kernel matrix * using the following equation: 


K'=K-1, K-K1,+1 Kl, 


Here, L, is an /'*"!- dimensional matrix (the same dimensions as the kernel matrix) where all 
| 


values are equal to ”. 


3. We collect the top k eigenvectors of the centered kernel matrix based on their corresponding 
eigenvalues, which are ranked by decreasing magnitude. In contrast to standard PCA, the 
eigenvectors are not the principal component axes but the samples projected onto those axes. 


At this point, you may be wondering why we need to center the kernel matrix in the second step. We 
previously assumed that we are working with standardized data, where all features have mean zero 
when we formulated the covariance matrix and replaced the dot products by the nonlinear feature 


combinations via ? .Thus, the centering of the kernel matrix in the second step becomes necessary, 
since we do not compute the new feature space explicitly and we cannot guarantee that the new 
feature space is also centered at zero. 


In the next section, we will put those three steps into action by implementing a kernel PCA 1n Python. 
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Implementing a kernel principal component analysis in 
Python 


In the previous subsection, we discussed the core concepts behind kernel PCA. Now, we are going to 
implement an RBF kernel PCA in Python following the three steps that summarized the kernel PCA 
approach. Using the SciPy and NumPy helper functions, we will see that implementing a kernel PCA 
is actually really simple: 


from scipy.spatial.distance import pdist, squareform 
from scipy import exp 

from scipy.linalg import eigh 

import numpy as np 


def Lot Kernel pca(%, Galima,y TM. Components): 


wey sv 


RBF kernel PCA implementation. 


Parameters 


xe {NUMPY Noarray], shape = (nm. Samples; Nn. teatures) 


gamma: float 
Tuning parameter of the RBF kernel 


i COMpOnents: 2nt 
Number of principal components to return 


Returns 
x pc: {NUMPyY Moarray;, Shape = (1) Samples, K Teatures | 
Projected dataset 


Woy vy 


# Calculate pairwise squared Euclidean distances 
# in the MxN dimensional dataset. 
Sq Olsts: = pPOlSsctx, “sqgcuclicdean’) 


# Convert pairwise distances into a square matrix. 
Mat SG OlLSts = SGuare OrmisCd Cisse) 


# Compute the symmetric kernel matrix. 
Kk = ©xp(-Callma * Mat Sq Gusts) 


# Center the kernel matrix. 

N = K.shape[0] 

one n = np.ones((N,N)) / N 

=i. - One 1.0oCts, = beGdoetcrte Ty = One 72.0or i) 4cor (one a) 


# Obtaining eigenpairs from the centered kernel matrix 

# numpy.eigh returns them in sorted order 

eigvals, eigvecs = eigh(K) WOW! eBook 
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# Collect the top k eigenvectors (projected samples) 


xX pe = fp.columm Stack ((e1qg veces (=, —21) 


LOr a 2m renge(1, nm. Components 7 1).)) 


PeLUrm 2 2c 


One downside of using an RBF kernel PCA for dimensionality reduction 1s that we have to specify the 


parameter ’” a priori. Finding an appropriate value for ’ requires experimentation and is best done 
using algorithms for parameter tuning, for example, grid search, which we will discuss in more detail 
in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning. 


Example 1 — separating half-moon shapes 


Now, let's apply our rbf kernel pca on some nonlinear example datasets. We will start by creating 
a two-dimensional dataset of 100 sample points representing two half-moon shapes: 


>>> from sklearn.datasets import make moons 


>>> X, y = make moons(n samples=100, random_state=123) 
S>> plt.scatter Xly—-0, 0], Xly==-0, 1], 


a8 color='red', marker='*', alpha=0.5) 
>>> plt.scatter (X[y==1, Ol], Xly==1, 11, 


O66 color='"blue', marker='o', alpha=0.5) 
>>> plt.show () 


For the purposes of illustration, the half-moon of triangular symbols shall represent one class and the 
half-moon depicted by the circular symbols represent the samples from another class: 


ab444h4, 
a” “a 


Ay 
A 
4 
Fy 
A 
r 
r 
r 
ri 
A 
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Clearly, these two half-moon shapes are not linearly separable and our goal is to unfold the halt- 
moons via kernel PCA so that the dataset can serve as a suitable input for a linear classifier. But first, 
let's see what the dataset looks like if we project it onto the principal components via standard PCA: 


>>> from sklearn.decomposition import PCA 

eer SEIKI pCa = PCA Compeenents—zZ) 

2 oe ea = SCL PCa. t1 tranerorn 

>>> fig, ax = plt.subplots (nrows=1,ncols=2, figsize=(7,3)) 
Per Ox |)]«SCatter( x Sspcaly==0, Ul, xX spcaly==0, Li, 

oa color='red', marker='*', alpha=0.5) 

27? Ox|U) esCaluer (x Spcaly-—-l, Oly Xx spcaly—=-l, 1], 

. Be color="blue', marker='o', alpha=0.5) 

eo OX) ~SCGaleer (x Spcaly——-0, Ol, mp.2eros ((50,1))70.02, 
Se color='red', marker='%*', alpha=0.5) 

2e7 axl | seeSCaccer (xy. Spcaly==1; Wily, ND.zeros( (590;1) j=U.02, 
bes color="blue', marker='o', alpha=0.5) 
>>> aX 


[Ul »sset xZlabel(* PCL?) 
2? EX Ui). SCL VilabeLi* POZ”) 
por exll|ssee yYlrmulal, 11) 
oro oxi lessee. VEreko (1) 
err Oxi leer xhabel( PCi”) 


>>> plt.show() 


Clearly, we can see 1n the resulting figure that a linear classifier would be unable to perform well on 
the dataset transformed via standard PCA: 


PPh. 
"Son, 





p> 
Se el 


-2.0—-15-10-0.5 0.0 0.5 10 15 2.0 =—2.0-1.5-10-0.5 06.0 05 10 15 2.0 
PCI PCI 


Note that when we plotted the first principal component only (right subplot), we shifted the triangular 


samples slightly upwards and the circular samples slightly downwards to better visualize the class 
overlap. 


Note 


Please remember that PCA is an unsupervised method and does not use class label information in 
order to maximize the variance in contrast to LRA, Here, the triangular and circular symbols were 
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just added for visualization purposes to indicate the degree of separation. 


Now, let's try out our kernel PCA function rbf kernel pca, which we implemented 1n the previous 
subsection: 


>>> from matplotlib.ticker import FormatStrFormatter 

27> & Kpca = rbt kernel pca(x, Ggamma-l5, n. components=2) 
>>> fig, ax = plt.subplots (nrows=1,ncols=2, figsize=(7,3)) 
oP Ex [0 |'sSCaceerix KpCaly==0, Ul, 2% KpCaly=—=U, ly 

3 color='red', marker='%*', alpha=0.5) 

por Ol) goCacce 42 KpCal =, Ula 2 socal ——L, Liz 

ues color="blue', marker='o', alpha=0.5) 

27? OX 1 | sSCabuer (xX KpCaly==0, Ol, Np-~-2Zeros( (5071) )70.02, 
224 color='red', marker='*', alpha=0.5) 

Zee aol | esCatter( kocaly—-—-., Ul, Npszeros ( (30,1).)-0.02, 
ee color='"blue', marker='o', alpha=0.5) 

>>> ax 


PO) 2seu. xlabeLt*PCl*) 
por Ex Viieseel Vhabel.(” PCZ* ) 
Pe? CO) oser. yim ply 1) 
POF aX 1) sset. YLaCks (1): 
woe Ox |.) Set label ("Pe ) 
eo > aX 0: +4exk1S2S6C May,or ToOrmacrler(POrmatoltrrormatter (20.15) ) 
Peo Ox | seeks. Sel Major formatter (Pormalotrrormatter(* 2.0.18") ) 


>>> plt.show() 


We can now see that the two classes (circles and triangles) are linearly well separated so that it 
becomes a suitable training dataset for linear classifiers: 


0.20 

0.15 
0.10 
0.05 

O 0.00] 
—0.05 
=0.10 
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.20 : : 
=—0.260.150.160.050.00 0.050.100.150.20 =-0.260.150.160.050.00 0.05 0.10 0.15 0.20 
PC1 PC1 


Unfortunately, there is no universal value for the tuning parameter ” that works well for different 


datasets. To find a ” value that is appropriate for a given problem requires experimentation. In 
Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, we will 
discuss techniques that can help us to automate the task of optimizing such tuning parameters. Here, I 


will use values for ” that I found produce good results. 
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Example 2 — separating concentric circles 


In the previous subsection, we showed you how to separate half-moon shapes via kernel PCA. Since 
we put so much effort into understanding the concepts of kernel PCA, let's take a look at another 
interesting example of a nonlinear problem: concentric circles. 


The code 1s as follows: 


2or LeOmM SkLeGarni.Galasets Import Make Circles 

Poe te Y = Make Cirches (1 Samp les=1000,; 

ner fancom Stabe—125;, MOlse=-C.1, Lacvor—0.2) 
SS> pli.scatter (x [yv—=0, Ol, xXly=—0, LI, 

.4 4 color='red', marker='*', alpha=0.5) 

Por PLesScCavccver (x |y==L, Oly XlyH=1l, Lily 

oe color='blue', marker='o', alpha=0.5) 

>>> plt.show() 


Again, we assume a two-class problem where the triangle shapes represent one class and the circle 
shapes represent another class, respectively: 


1.5) 


: | A 
1.0} F ie | a ahha “Be “bak » 
“hh ‘. Sa we or. 
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Let's start with the standard PCA approach to compare it with the results of the RBF kernel PCA: 


Po? SCLELG pCa = PCAN <COmponenils=—2Z) 

27? & SpCa, = SCLKILE PCesl ie Erans tor (x) 

>>> fig, ax = plt.subplots (nrows=1,ncols=2, figsize=(7,3)) 
oer ax |OlasCalver (x spcaly==0, UU], x spcaly—=U, lly 

Las color='red', markewowi eBook Pha=0 -5) 

>>> ax[0].scatter(X_spcaly==1, 0] wwWw.wewebaoicorg 11, 


>>> 
>>> 
>>> 
>>> 
>>> 
>>> 


>>> 
Pe a 


color='blue', marker='o', alpha=0.5) 

atl) Calter ie Speco ly==U, Uly, Mp. 2670s ( (500; 1) Os. 0Z, 
color='red', marker='*', alpha=0.5) 

eli eseetrrer( Spcaly--l, Ui, Ges 2e roe ( (500... 0.072, 
color='blue', marker='o', alpha=0.5) 

ax|U)]«seu xlabel (7 Pcl”) 

ax|0) «set ylabel('PC2*) 

exit iissee ViaiCl=i, 27) 

ax[ll «set yreicks ( (|) 

ax | Set x label ("Pel") 

pit.show() 


Again, we can see that standard PCA 1s not able to produce results suitable for training a linear 
classifier: 
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0.5 


O05 10 #4L.5 -15 -10 -0.5 0.0 


PCl 


10 1.5 


Given an appropriate value for ” , let's see if we are luckier using the RBF kernel PCA 


implementation: 

27? & KpCa = FD Kernel. peaixX, Ganma=l5, m1 Components—z) 
>>> fig, ax = plt.subplots (nrows=1,ncols=2, figsize=(7,3)) 
27? OX Ul) ~SCalver (x Kpcaly=—-0, 0); xX kpcaly——Uy, Ly 

eee color='red', marker='%*', alpha=0.5) 

Rae EW | eCacle 4, KoCalveal, Gl, 2% Kee2| —=L, LI, 

ee color='blue', marker='o', alpha=0.5) 

27? exli| <sCatvter(x Kpcaly==0, Ul, Dp«Zeros ( (500, 1) ) 70.02; 
a5 color='red', marker='*', alpha=0.5) 

ee Cols leeealrer (2 Roca y--l, Vig @p27e Ce (1200 ,.1) 7-32.07, 
Te color='blue', marker='o', alpha=0.5) 

PoP ex |0 | .seu. xLabel{* PCL) 

22> Ex 0-)2SCe Yilabel.(*PCZ* ) 

eo OX LJ esse Vian lb, 1). 

ear OX 1) oel VEeUCks { ||.) 

eer ex.) Ser xlabet( PCL) 

>>> plt.show() 


Again, the RBF kernel PCA projected the data GnWcddmoy, subspace where the two classes become 


linearly separable: 
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Projecting new data points 


In the two previous example applications of kernel PCA, the half-moon shapes and the concentric 
circles, we projected a single dataset onto a new feature. In real applications, however, we may have 
more than one dataset that we want to transform, for example, training and test data, and typically also 
new samples we will collect after the model building and evaluation. In this section, you will learn 
how to project data points that were not part of the training dataset. 


As we remember from the standard PCA approach at the beginning of this chapter, we project data by 
calculating the dot product between a transformation matrix and the input samples; the columns of the 


projection matrix are the top “ eigenvectors (” ) that we obtained from the covariance matrix. Now, 
the question is how can we transfer this concept to kernel PCA? If we think back to the idea behind 
kernel PCA, we remember that we obtained an eigenvector (“ ) of the centered kernel matrix (not the 
covariance matrix), which means that those are the samples that are already projected onto the 


principal component axis ” . Thus, 1f we want to project a new sample x" onto this principal 
component axis, we'd need to compute the following: 


a(x’) v 


ay 
Fortunately, we can use the kernel trick so that we don't have to calculate the projection ox’) ¥ 
explicitly. However, it is worth noting that kernel PCA, in contrast to standard PCA, 1s a memory- 
based method, which means that we have to reuse the original training set each time to project new 
samples. We have to calculate the pairwise RBF kernel (similarity) between each ! th sample in the 


training dataset and the new sample x. 


p(x!) v= Dia9(x' o(x") 


; 


, ae 
= > ak (x a) 


Here, eigenvectors “ and eigenvalues A of the Kernel matrix satisfy the following condition in 


the equation: 
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Ka =/Aa 


After calculating the similarity between the new samples and the samples in the training set, we have 
to normalize the eigenvector “ by its eigenvalue. Thus, let's modify the rbf kernel pca function 
that we implemented earlier so that it also returns the eigenvalues of the kernel matrix: 


from scipy.spatial.distance import pdist, squareform 
from scipy import exp 

from scipy.linalg import eigh 

import numpy as np 


def rot kernel pca({xX, gamma, n. components) ; 


Woy vy 


RBF kernel PCA implementation. 


Parameters 


XS tNumPy naarray}, shape = [nm samples, n_Teatures] 


Gamma: EdoOac 
Tuning parameter of the RBF kernel 


nh COMpOnents;: 10t 
Number of principal components to return 


Returns 


xX pc: {NumPy noarray!, Shape — |n samples, K Teatures| 
Projected dataset 


lambdas: list 
Eigenvalues 


woes vy 


# Calculate pairwise squared Euclidean distances 
# in the MxN dimensional dataset. 
SG. Clete = POleei, “SCeleli¢el, 


# Convert pairwise distances into a square matrix. 
Moe SG CiIsee = SCuare Orme. Gist e) 


# Compute the symmetric kernel matrix. 
Kk = €xp(-Galma ~ Tat Sc Gists) 


# Center the kernel matrix. 

N = K.shape[0] 

one n = np.ones((N,N)) / N 

K=h = One DeCOolC(s) — heGdoultone 1) + One 1.001) <Cot(one 1) 


# Obtaining eigenpairs from the wenteBeok kernel matrix 
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# numpy.eigh returns them in sorted order 
eigvals, eigvecs = eigh(K) 


# Collect the top k eigenvectors (projected samples) 
alphas: = Np.coLUMmn Slack ( (610VeCSs | 5 p= 2! 
Fort 1 1m range (i,m. Components .).)) 


# Collect the corresponding eigenvalues 
lamnbaas = \e1gvals|(—1) for 1. 2m Pange(l,n components: 1) | 


return alphas, lambdas 


Now, let's create a new half-moon dataset and project it onto a one-dimensional subspace using the 
updated RBF kernel PCA implementation: 


27o hy Y = Make moons (n samples=L00, random sStace=175) 
oor alphas, lLambdas =roft kernel pca(x, gamma—-lo,; nm components=l) 


To make sure that we implement the code for projecting new samples, let's assume that the 26th point 


from the half-moon dataset is a new data point x , and our task is to project it onto this new 
subspace: 


27 & new = KizZo] 

>>> xX new 

arfay( | 2.87 T3167 , 0.00928245]) 

>>> x proj = alphas[25] # original projection 

eae PEO | 

array([ 0.07877284]) 

Pr> Oet Project x(x. mew, XA, Gamma, alphas, Lambdas):: 


Pair cist. = Np.array (i 1p.sum 
(xX new-row)**2) for row in X]) 
kK = Np.Cxp (“Gala * Pai: Ore.) 


return k.dot(alphas / lambdas) 


By executing the following code, we are able to reproduce the original projection. Using the 
project x function, we will be able to project any new data samples as well. The code is as 
follows: 


27> © TEDL] = projece x(x New, x; 

_ gamma=15, alphas=alphas, lLambdas=lambdas) 
Peo x Lepro) 
array([ 0.07877284]) 


Lastly, let's visualize the projection on the first principal component: 


>>> plt.scatter(alphas[y==0, 0], np.zeros((50)), 

er color='red', marker='%*',alpha=0.5) 

>>> plt.scatter(alphas[y==1, 0], np.zeros((50)), 

a color='"blue', marker='o', alpha=0.5) 

e7 > PiLe«esSCaleeri x Proj, OU, COlor="bDlack", 
habel="Ori ginal proyecre1on. Of pointe AZo, 
marker='*', s=100) WOW! eBook 
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ez Dill. SCocter (x £epro), OU, Color] "Green", 
label='remapped point X[25]', 

oes marker='x', s=500) 

Zo Dic. Geno sCarlerpoluus—L) 

>>> plt.show () 


As we can see in the following scatterplot, we mapped the sample x’ onto the first principal 
component correctly: 


0.010 


original projection of point X[25] 
remapped point xX[25] 
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Kernel principal component analysis in scikit-learn 


For our convenience, scikit-learn implements a kernel PCA class in the sklearn. decomposition 
submodule. The usage is similar to the standard PCA class, and we can specify the kernel via the 
kernel parameter: 


>>> from sklearn.decomposition import KernelPCA 

Poo? hp Y = Make moons (nm. semples=100, random starve=l75) 
por BOIL Roca = Rete LPCA(n Components —Z, 

; es kernel='rbf', gamma=15) 

Per KK Skernpca = SCLELt. kKpca.fit transform (x) 


To see if we get results that are consistent with our own kernel PCA implementation, let's plot the 


transformed half-moon shape data onto the first two principal components: 
Poo DitwocCacler(% Sskernpca|y==0, Ul, ~ SkeripCcaly=—U, A), 
ae color='red', marker='*', alpha=0.5) 

Pe? PiisesCalcter (x skernipecaly-Hl, UO], 7 Sreripca|(y--1l, ii, 
24 8 color="blue', marker='o', alpha=0.5) 

Poo Dit. <leabel ("PCi.* ) 

27S ii. babel *eC2* 

>>> plt.show() 


As we can see, the results of the scikit-learn Kerne1PCa are consistent with our own implementation: 
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Note 
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Scikit-learn also implements advanced techniques for nonlinear dimensionality reduction that are 
beyond the scope of this book. You can find a nice overview of the current implementations in scikit- 
learn complemented with illustrative examples at http://scikit- 


learn.org/stable/modules/manifold.html. 
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Summary 


In this chapter, you learned about three different, fundamental dimensionality reduction techniques for 
feature extraction: standard PCA, LDA, and kernel PCA. Using PCA, we projected data onto a 
lower-dimensional subspace to maximize the variance along the orthogonal feature axes while 
ignoring the class labels. LDA, in contrast to PCA, is a technique for supervised dimensionality 
reduction, which means that it considers class information in the training dataset to attempt to 
maximize the class-separability in a linear feature space. Lastly, you learned about a kernelized 
version of PCA, which allows you to map nonlinear datasets onto a lower-dimensional feature space 
where the classes become linearly separable. 


Equipped with these essential preprocessing techniques, you are now well prepared to learn about 
the best practices for efficiently incorporating different preprocessing techniques and evaluating the 
performance of different models 1n the next chapter. 
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Chapter 6. Learning Best Practices for Model 
Evaluation and Hyperparameter Tuning 


In the previous chapters, you learned about the essential machine learning algorithms for 
classification and how to get our data into shape before we feed it into those algorithms. Now, it's 
time to learn about the best practices of building good machine learning models by fine-tuning the 
algorithms and evaluating the model's performance! In this chapter, we will learn how to: 


Obtain unbiased estimates of a model's performance 

Diagnose the common problems of machine learning algorithms 
Fine-tune machine learning models 

Evaluate predictive models using different performance metrics 


WOW! eBook 
www.wowebook.org 


Streamlining workflows with pipelines 


When we applied different preprocessing techniques in the previous chapters, such as 
standardization for feature scaling in Chapter 4, Building Good Training Sets — Data 
Preprocessing, or principal component analysis for data compression in Chapter 5, Compressing 
Data via Dimensionality Reduction, you learned that we have to reuse the parameters that were 
obtained during the fitting of the training data to scale and compress any new data, for example, the 
samples in the separate test dataset. In this section, you will learn about an extremely handy tool, the 
Pipeline class in scikit-learn. It allows us to fit a model including an arbitrary number of 
transformation steps and apply it to make predictions about new data. 
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Loading the Breast Cancer Wisconsin dataset 


In this chapter, we will be working with the Breast Cancer Wisconsin dataset, which contains 569 
samples of malignant and benign tumor cells. The first two columns in the dataset store the unique ID 
numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively. 
The columns 3-32 contain 30 real-value features that have been computed from digitized images of 
the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant. 
The Breast Cancer Wisconsin dataset has been deposited on the UCI machine learning repository 
and more detailed information about this dataset can be found at 
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsint+( Diagnostic). 





In this section we will read in the dataset, and split it into training and test datasets in three simple 
Steps: 


1. We will start by reading in the dataset directly from the UCI website using pandas: 


>>> import pandas as pd 
>>> df = pd.read csv('https://archive.ics.uci.edu/ml/machine-learning- 
databases/breast-cancer-wisconsin/wdbc.data', header=None) 


2. Next, we assign the 30 features to a NumPy array x. Using LabelEncoder, we transform the 
class labels from their original string representation (™ and 8) into integers: 


>>> from sklearn.preprocessing import LabelEncoder 
>>> X = df.loc[:, 2:].values 

Soe ¥Y = OisLOCl|s, LiwvalLues 

>>> le = LabelEncoder () 

Por J = enti trans torm(y) 


After encoding the class labels (diagnosis) in an array y, the malignant tumors are now 
represented as class 1, and the benign tumors are represented as class 0, respectively, which we 
can illustrate by calling the transform method of LabelEncoder on two dummy class labels: 


>>> le.transtrorm(['M*, 'B*).) 
array (i, 01) 


3. Before we construct our first model pipeline in the following subsection, let's divide the dataset 
into a separate training dataset (80 percent of the data) and a separate test dataset (20 percent of 
the data): 


Zo? EEOM SKLeaArNaCrOss ValrGavion 2MpoOre Train test. Splat 
>> XK train, x vest, vy train, y test = \ 
thei 26st splitix, VY, Gest size=0.20;, Landon. Sstave=1) 
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Combining transformers and estimators in a pipeline 


In the previous chapter, you learned that many learning algorithms require input features on the same 
scale for optimal performance. Thus, we need to standardize the columns in the Breast Cancer 
Wisconsin dataset before we can feed them to a linear classifier, such as logistic regression. 
Furthermore, let's assume that we want to compress our data from the initial 30 dimensions onto a 
lower two-dimensional subspace via principal component analysis (PCA), a feature extraction 
technique for dimensionality reduction that we introduced in Chapter 5, Compressing Data via 
Dimensionality Reduction. Instead of going through the fitting and transformation steps for the 
training and test dataset separately, we can chain the StandardScaler, PCA, and 
LogisticRegression objects ina pipeline: 


>>> from sklearn.preprocessing import StandardScaler 
>>> from sklearn.decomposition import PCA 
yor LOM SkiGatn« linear model import. LogisticRegression 
>>> from sklearn.pipeline import Pipeline 
Poo? pipe... = Fipelrme(|(*sel*, otandartoclocaler()), 
("pCa , PCA(M COmpOonents—=Z)), 
ou ( CLE”, hOGUStiCRegressi10n (random, Srarte=1).) |) 
2 Pipe. Visi tCt(x% Crain, Yo etain) 
Poo Pirin Test. 2CCuUracy: «.28 @~- Pipe Jy .sCore(% vest, VY test) ) 
Test Accuracy: 0.947 


The Pipeline object takes a list of tuples as input, where the first value in each tuple 1s an arbitrary 
identifier string that we can use to access the individual elements in the pipeline, as we will see later 
in this chapter, and the second element in every tuple is a scikit-learn transformer or estimator. 


The intermediate steps 1n a pipeline constitute scikit-learn transformers, and the last step 1s an 
estimator. In the preceding code example, we built a pipeline that consisted of two intermediate steps, 
a StandardScaler and a Pca transformer, and a logistic regression classifier as a final estimator. 
When we executed the fit method on the pipeline pipe ir, the StandardScaler performed fit and 
transform on the training data, and the transformed training data was then passed onto the next 
object in the pipeline, the pca. Similar to the previous step, Pca also executed fit and transform on 
the scaled input data and passed it to the final element of the pipeline, the estimator. We should note 
that there 1s no limit to the number of intermediate steps in this pipeline. The concept of how 
pipelines work is summarized in the following figure: 
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Using k-fold cross-validation to assess model 
performance 


One of the key steps in building a machine learning model is to estimate its performance on data that 
the model hasn't seen before. Let's assume that we fit our model on a training dataset and use the same 
data to estimate how well it performs in practice. We remember from the Zackling overfitting via 
regularization section 1n Chapter 3, A Jour of Machine Learning Classifiers Using Scikit-learn, that 
a model can either suffer from underfitting (high bias) 1f the model is too simple, or it can overfit the 
training data (high variance) if the model 1s too complex for the underlying training data. To find an 
acceptable bias-variance trade-off, we need to evaluate our model carefully. In this section, you will 
learn about the useful cross-validation techniques holdout cross-validation and k-fold cross- 
validation, which can help us to obtain reliable estimates of the model's generalization error, that 1s, 
how well the model performs on unseen data. 
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The holdout method 


A classic and popular approach for estimating the generalization performance of machine learning 
models is holdout cross-validation. Using the holdout method, we split our initial dataset into a 
separate training and test dataset—the former 1s used for model training, and the latter is used to 
estimate its performance. However, in typical machine learning applications, we are also interested 
in tuning and comparing different parameter settings to further improve the performance for making 
predictions on unseen data. This process is called model selection, where the term model selection 
refers to a given classification problem for which we want to select the optimal values of tuning 
parameters (also called hyperparameters). However, if we reuse the same test dataset over and over 
again during model selection, 1t will become part of our training data and thus the model will be more 
likely to overfit. Despite this issue, many people still use the test set for model selection, which 1s not 
a good machine learning practice. 


A better way of using the holdout method for model selection 1s to separate the data into three parts: a 
training set, a validation set, and a test set. The training set is used to fit the different models, and the 
performance on the validation set is then used for the model selection. The advantage of having a test 
set that the model hasn't seen before during the training and model selection steps is that we can 
obtain a less biased estimate of its ability to generalize to new data. The following figure illustrates 
the concept of holdout cross-validation where we use a validation set to repeatedly evaluate the 
performance of the model after training using different parameter values. Once we are satisfied with 
the tuning of parameter values, we estimate the models' generalization error on the test dataset: 


Original set 
Training set : ‘Validation set 


Training, tuning, and 
evaluation 


Machine learning’ \ 
algorithm 


Predictive Mode! 


Final performance estimate 





A disadvantage of the holdout method is that the performance estimate 1s sensitive to how we 
partition the training set into the training and validation subsets; the estimate will vary for different 
samples of the data. In the next subsection, we will take a look at a more robust technique for 
performance estimation, k-fold cross- validati dit: "WERE we repeat the holdout method & times on k 


ww.wowe 


subsets of the training data. 
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K-fold cross-validation 


In k-fold cross-validation, we randomly split the training dataset into & folds without replacement, 


where *~—! folds are used for the model training and one fold is used for testing. This procedure is 
repeated k times so that we obtain k models and performance estimates. 


Note 


In case you are not familiar with the terms sampling with and without replacement, let's walk through 
a simple thought experiment. Let's assume we are playing a lottery game where we randomly draw 
numbers from an urn. We start with an urn that holds five unique numbers 0, 1, 2, 3, and 4, and we 
draw exactly one number each turn. In the first round, the chance of drawing a particular number from 
the urn would be 1/5. Now, in sampling without replacement, we do not put the number back into the 
urn after each turn. Consequently, the probability of drawing a particular number from the set of 
remaining numbers in the next round depends on the previous round. For example, 1f we have a 
remaining set of numbers 0, 1, 2, and 4, the chance of drawing number 0 would become 1/4 in the next 
turn. 


However, in random sampling with replacement, we always return the drawn number to the urn so 
that the probabilities of drawing a particular number at each turn does not change; we can draw the 
Same number more than once. In other words, 1n sampling with replacement, the samples (numbers) 
are independent and have a covariance zero. For example, the results from five rounds of drawing 
random numbers could look like this: 


e Random sampling without replacement: 2, 1, 3, 4, 0 
e Random sampling with replacement: 1, 3, 3, 4, 1 


We then calculate the average performance of the models based on the different, independent folds to 
obtain a performance estimate that is less sensitive to the subpartitioning of the training data 
compared to the holdout method. Typically, we use k-fold cross-validation for model tuning, that 1s, 
finding the optimal hyperparameter values that yield a satisfying generalization performance. Once 
we have found satisfactory hyperparameter values, we can retrain the model on the complete training 
set and obtain a final performance estimate using the independent test set. 


Since k-fold cross-validation 1s a resampling technique without replacement, the advantage of this 
approach 1s that each sample point will be part of a training and test dataset exactly once, which 
yields a lower-variance estimate of the model performance than the holdout method. The following 


figure summarizes the concept behind k-fold cross-validation with k=10. The training data set is 
divided into 10 folds, and during the 10 iterations, 9 folds are used for training, and | fold will be 


used as the test set for the model evaluation. Also, the estimated performances ep, (for example, 
classification accuracy or error) for each fold are then used to calculate the estimated average 
performance £ of the model: 
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Training set 





Training folds Test fold 
2™ iteration = 
ay 
— 104, 
{= 
3" iteration 





The standard value for & in k-fold cross-validation is 10, which 1s typically a reasonable choice for 
most applications. However, if we are working with relatively small training sets, it can be useful to 
increase the number of folds. If we increase the value of £, more training data will be used in each 
iteration, which results in a lower bias towards estimating the generalization performance by 
averaging the individual model estimates. However, large values of & will also increase the runtime 
of the cross-validation algorithm and yield estimates with higher variance since the training folds will 
be more similar to each other. On the other hand, 1f we are working with large datasets, we can 


choose a smaller value for k, for example, K =) and still obtain an accurate estimate of the average 
performance of the model while reducing the computational cost of refitting and evaluating the model 
on the different folds. 


Note 


A special case of k-fold cross validation is the leave-one-out (LOO) cross-validation method. In 
LOO, we set the number of folds equal to the number of training samples (4 =n) so that only one 
training sample is used for testing during each iteration. This 1s a recommended approach for working 
with very small datasets. 


A slight improvement over the standard k-fold cross-validation approach 1s stratified k-fold cross- 
validation, which can yield better bias and variance estimates, especially in cases of unequal class 
proportions, as it has been shown in a study by R. Kohavi et al. (R. Kohavi et al. A Study of Cross- 
validation and Bootstrap for Accuracy Estimation and Model Selection. In ljcai, volume 14, pages 
1137-1145, 1995). In stratified cross-validation, the class proportions are preserved in each fold to 
ensure that each fold is representative of the chats PROP Orti ons in the training dataset, which we will 
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illustrate by using the st ratifiedKFold iterator in scikit-learn: 


>>> AMpOre. Hhumpy as. np 

yoo EEOM SKICatt.ClOss Validatlonw TmpOre Diratltevedkrold 
oo KEOLG = OtraririeinrOldiy-y Crain, 

i POLO oe, 

. 45 rancom, State=1) 

>>> scores = [] 
>>> TOF ky, (Crain, Lest). In enumerace (kKrold) ¢ 

Pipe 17, 7aC(% Lrainttrain|,; Y train | train) 

SCOLe = Pipe 1.2 Core( xX train Gest, 7 eran itest|) 
scores.append (score) 

DramcCt(* Foleo: 26; Chass Gist. cS, 2Cee cest” cg (kL, 
$408 sO LOcCOuUme 7 Erato, Se Ole) ) 
FOiLG:; 


Ly theses Gdast.2 [256 53), Accs: 0.691. 
POLG? 2, Class cdist.: [250 153], Ace? U.97c 
POoba? 3; Class dist.: (256 153), Ace: 0.975 
POLO? 4, Class dist.: [256 £531, Accs U.915 
POLO: Oy Chass dist.? (256. 15315 Ace: Us9s5 
POlLds! ©, Clase distus: [Zor Loo], Accs 0.976 
BOLO: 7, Class cdist.: [zoy 253), A2eEeCe U.9233 
POLO: 6, Class O1et.2 (25!) dos), AGee 02956 
POLG? 9, Class cdist.?: [257 153), Aces U.9Ts 
POLO: 0, Clase O2.5C.2 (257 L533), Ace: 0.956 


Soo PLiInc ("CY accuracy: c.3f +7 sell’ « { 
ke np.mean(scores), np.std(scores) )) 
GY eccuracy: 0.950 +F/=— 0.029 


First, we initialized the st ratifiedKfold iterator fromthe sklearn.cross validation module 
with the class labels y train in the training set, and specified the number of folds via the n folds 
parameter. When we used the kfold iterator to loop through the k folds, we used the returned indices 
in train to fit the logistic regression pipeline that we set up at the beginning of this chapter. Using the 
pile lr pipeline, we ensured that the samples were scaled properly (for instance, standardized) in 
each iteration. We then used the test indices to calculate the accuracy score of the model, which we 
collected in the scores list to calculate the average accuracy and the standard deviation of the 
estimate. 


Although the previous code example was useful to illustrate how k-fold cross-validation works, 
scikit-learn also implements a k-fold cross-validation scorer, which allows us to evaluate our model 
using stratified k-fold cross-validation more efficiently: 


or? TLOm Sklearm.CroOss Validation AMpOre Cross Val score 
Po? SCOreES = Cross Val, SCOre (EStimalor—=pipe ir, 

X=X train, 

V=y train, 


Ccv=10, 
or iy 00s =1) 
>>> print('CV accuracy scores: %s"' % scores) 
CV eccuracy scores: | 0.6091350435: U.97826000G7 ‘C.9782600387 


Ow 91304546 Un 9S476Z60). One 977 77776 
0.93333333 O.Q5qPSB5R 0.97777778 
0.955555 )aw.wowebook.org 


>>> print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores) )) 
CY eecuracy: 0.950 27] 0.0727 


An extremely useful feature of the cross val score approach 1s that we can distribute the 
evaluation of the different folds across multiple CPUs on our machine. If we set the n jobs 
parameter to 1, only one CPU will be used to evaluate the performances just like in our 
StratifiedKFold example previously. However, by setting n jobs=2 we could distribute the 10 
rounds of cross-validation to two CPUs (if available on our machine), and by setting n jobs=-1, we 
can use all available CPUs on our machine to do the computation in parallel. 


Note 


Please note that a detailed discussion of how the variance of the generalization performance 1s 
estimated in cross-validation is beyond the scope of this book, but you can find a detailed discussion 
in this excellent article by M. Markatou et al (M. Markatou, H. Tian, S. Biswas, and G. M. Hripcsak. 
Analysis of Variance of Cross-validation Estimators of the Generalization Error. Journal of 
Machine Learning Research, 6:1127—1168, 2005). 


You can also read about alternative cross-validation techniques, such as the .632 Bootstrap cross- 
validation method (B. Efron and R. Tibshirani. /mprovements on Cross-validation: The 632+ 
Bootstrap Method. Journal of the American Statistical Association, 92(438):548—560, 1997). 
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Debugging algorithms with learning and 
validation curves 


In this section, we will take a look at two very simple yet powerful diagnostic tools that can help us 
to improve the performance of a learning algorithm: learning curves and validation curves. In the 
next subsections, we will discuss how we can use learning curves to diagnose if a learning algorithm 
has a problem with overfitting (high variance) or underfitting (high bias). Furthermore, we will take a 
look at validation curves that can help us address the common issues of a learning algorithm. 
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Diagnosing bias and variance problems with learning 
curves 


Ifa model 1s too complex for a given training dataset—there are too many degrees of freedom or 
parameters in this model—the model tends to overfit the training data and does not generalize well to 
unseen data. Often, it can help to collect more training samples to reduce the degree of overfitting. 
However, 1n practice, it can often be very expensive or simply not feasible to collect more data. By 
plotting the model training and validation accuracies as functions of the training set size, we can 
easily detect whether the model suffers from high variance or high bias, and whether the collection of 
more data could help to address this problem. But before we discuss how to plot learning curves 1n 
sckit-learn, let's discuss those two common model issues by walking through the following 
illustration: 


High bias High variance 


Number of training samples | Number of training samples 


Good bias-variance trade-off 


"ew Training accuracy 
—._-—sv«»Vailidation accuracy 


Desired accuracy 




















Number of training samples 





The graph in the upper-left shows a model with high bias. This model has both low training and 
cross-validation accuracy, which indicates that it underfits the training data. Common ways to address 
this issue are to increase the number of parameters of the model, for example, by collecting or 
constructing additional features, or by decreasingvtkeaddgree of regularization, for example, in SVM 


www.wowebook.org 


or logistic regression classifiers. The graph in the upper-right shows a model that suffers from high 
variance, which 1s indicated by the large gap between the training and cross-validation accuracy. To 
address this problem of overfitting, we can collect more training data or reduce the complexity of the 
model, for example, by increasing the regularization parameter; for unregularized models, it can also 
help to decrease the number of features via feature selection (Chapter 4, Building Good Training 
Sets — Data Preprocessing) or feature extraction (Chapter 5, Compressing Data via Dimensionality 
Reduction). We shall note that collecting more training data decreases the chance of overfitting. 
However, it may not always help, for example, when the training data is extremely noisy or the model 
is already very close to optimal. 


In the next subsection, we will see how to address those model issues using validation curves, but 
let's first see how we can use the learning curve function from scikit-learn to evaluate the model: 


Zee 
>>> 
>>> 


PP? 


27 
>>> 
>>> 
>>> 
>>> 


>>> 


Ao? 


>>> 


>>> 
>>> 
>>> 
>>> 
>>> 
2 


import matplotlib.pyplot as plt 
From SkKiLeGarn..Carning Curve Import learning Curve 
pape Jy = Pipeline (| 
(“scl’, sblandaroscaler()), 
('clf£', LogisticRegression ( 
Penalty—"l2", random. State=—0) ) |.) 
train sizes, train scores, test scores =\ 
learning Curve(Sestimaror=pipe ir, 
X=X train, 
V=y_ train, 
tie. See o= pattie pece (Uae “Lad, 


Cv=10, 

a. Jops=1) 
claim Mean = Np.Mean (Crain. Scores; axie=1) 
tLiaim SUG = Np.~sud(trainm S¢€Ores, axis—)) 
test Mean = fo.mean(tese Scores, axi1s—1) 
test ObG. = Nos C(eeot Scores, eas=1) 


DEC~PLOC (rain S765, trail Mean, 
color='blue', marker='o', 
markersize=5, 
label='training accuracy') 

Dit.~t111. Detween (train sizes, 

Epa Mea + trad 2b, 
Leal Meam = “Liat std, 
alpha=0.15, color='blue') 

Dit.pLoe(lraint 6176s, Tecr. mean, 
color="green', linestyle='--', 
marker='s', markersize=), 
label='validation accuracy') 

DitsetilLl between (train S1265, 

ceSt Meal + Test Sid, 
best Meant = Test sco, 
alpha=0.15, color='green') 

Dit.grid() 

plt.xlabel('Number of training samples') 

plt.ylabel ('Accuracy') 

plt.legend(loc="lower right') 
0 


O 
PLE<VIGM U2 Sy 220] } WOW! eBook 
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After we have successfully executed the preceding code, we will obtain the following learning curve 
plot: 


1.00 
0.95 


0.90 


Accuracy 





0 50 100 150 200 250 300 350 400 450 
Number of training samples 


Via the train sizes parameter in the learning curve function, we can control the absolute or 
relative number of training samples that are used to generate the learning curves. Here, we set 

train sizes=np.linspace(0.1, 1.0, 10) to use 10 evenly spaced relative intervals for the 
training set sizes. By default, the learning curve function uses stratified k-fold cross-validation to 
calculate the cross-validation accuracy, and we set * =!0 via the cv parameter. Then, we simply 
calculate the average accuracies from the returned cross-validated training and test scores for the 
different sizes of the training set, which we plotted using matplotlib's plot function. Furthermore, we 
add the standard deviation of the average accuracies to the plot using the f£i11 between function to 
indicate the variance of the estimate. 


As we can see in the preceding learning curve plot, our model performs quite well on the test dataset. 
However, it may be slightly overfitting the training data indicated by a relatively small, but visible, 
gap between the training and cross-validation accuracy curves. 
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Addressing overfitting and underfitting with validation 
curves 


Validation curves are a useful tool for improving the performance of a model by addressing issues 
such as overfitting or underfitting. Validation curves are related to learning curves, but instead of 
plotting the training and test accuracies as functions of the sample size, we vary the values of the 
model parameters, for example, the inverse regularization parameter c in logistic regression. Let's go 
ahead and see how we create validation curves via sckit-learn: 


yor £rOom SkiGarns LGarning curve amporl Valiagalion: Curve 
27> Dotam sanoe = (O.G0l, Us<0l, Uel, bot, L020, 290.0] 
oo Ela SCOres, Loo Sere. = Vall catiot Curvy | 

SeceiIMalOr=pipe Lt, 

X=X train, 

V=y train, 

Param Name=clr C'; 

param range=param range, 


cv=10) 
27? Train Meal = Np.Mean (train SCOlLes;, axis=1) 
eee wiein Sed. = Np«Stla, Train SCOres, axis—1) 
yoo teoe Meal = NoyMean(tesl. Scores, axus—h) 


eo TeSe. SUG = 1pssSlal(lest Scores, axis= 1) 
por Die. pIOU (param Pange, Crain Mean, 

color='blue', marker='o', 

markers1ize=), 
26 label='training accuracy') 
>>> plt.fill between(param range, train mean + train std, 

brain, Mean — train sta, alpha-0.15, 

ae & color='blue') 
Pe? DiispLOu( param Fange, Lest mean, 

color='green', linestyle='--', 

marker='s', markersize=), 
ei label='validation accuracy') 
>>> plt.fill between (param range, 
Lest, Mean + test std, 
Cece Mean = Test eG, 
er alpha=0.15, color='green') 
2 DL .Orid,() 
PP Fr DLE XScale 
ee Plt. Legend 
>>> plt.xlabel('Parameter C') 
>>> plt.ylabel ('Accuracy') 
ao Dlr, Vilam elO.e, hs.) 
>>> plt.show() 


"log') 
loc='lower right") 


OO ON ON O™ 


Using the preceding code, we obtained the validation curve plot for the parameter c: 
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1} 
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5 0.90 
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Parameter C 


Similar to the learning curve function, the validation curve function uses stratified k-fold 
cross-validation by default to estimate the performance of the model if we are using algorithms for 
classification. Inside the validation curve function, we specified the parameter that we wanted to 
evaluate. In this case, it is c, the inverse regularization parameter of the LogisticRegression 
classifier, which we wrote as 'clf  c' to access the LogisticRegression object inside the scikit- 
learn pipeline for a specified value range that we set via the param range parameter. Similar to the 
learning curve example in the previous section, we plotted the average training and cross-validation 
accuracies and the corresponding standard deviations. 


Although the differences 1n the accuracy for varying values of c are subtle, we can see that the model 
Slightly underfits the data when we increase the regularization strength (small values of c). However, 
for large values of c, 1t means lowering the strength of regularization, so the model tends to slightly 
overfit the data. In this case, the sweet spot appears to be around c=0.1. 
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Fine-tuning machine learning models via grid 
search 


In machine learning, we have two types of parameters: those that are learned from the training data, 
for example, the weights 1n logistic regression, and the parameters of a learning algorithm that are 
optimized separately. The latter are the tuning parameters, also called hyperparameters, of a model, 
for example, the regularization parameter in logistic regression or the depth parameter of a decision 
tree. 


In the previous section, we used validation curves to improve the performance of a model by tuning 
one of its hyperparameters. In this section, we will take a look at a powerful hyperparameter 
optimization technique called grid search that can further help to improve the performance of a model 
by finding the optimal combination of hyperparameter values. 
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Tuning hyperparameters via grid search 


The approach of grid search is quite simple, it's a brute-force exhaustive search paradigm where we 
specify a list of values for different hyperparameters, and the computer evaluates the model 
performance for each combination of those to obtain the optimal set: 


vor £rom SklGarnsaGrid Search Import Grigscarcncy 
>>> from sklearn.svm import SVC 


por Pipe crc = Papel fi sci es stale t deca erty 
ce CLL” % sVvCAranoom, state=.) J 1) 
Pe Param range = [U.0001, Us001, O01, O.l, LeO, 1020, LOUD, 1000.0) 
PoP Param Gr1G = |i Clr ©"? Param Lange, 

"Cl Kernel’ s. |. Lanear” | iy 

tf ela 4" & Pera Tene, 

'Clf gamma': param range, 
ae ‘Clit Kernel’. *ror™ |} 
Jer GS = CrioSseercncy (estimavlor—plpe_ svc, 


param grid=param grid, 
scoring='accuracy', 
Ccv=10, 

ears A Ob e=—=1) 

por Gs — GCe.t Nex trait, Y rain) 

Pee PENG (Gs, Dest. Score |) 

De 9 TeUZIOFeOZ2 

Poo PIANC Gs.best Params. ) 

a. CLE. O". Uael, “CLE Kernel] "titer | 


Using the preceding code, we initialized a GridSearchcv object from the sklearn.grid search 
module to train and tune a support vector machine (SVM) pipeline. We set the param grid 
parameter of GridSearchcv to a list of dictionaries to specify the parameters that we'd want to tune. 
For the linear SVM, we only evaluated the inverse regularization parameter c; for the RBF kernel 
SVM, we tuned both the c and gamma parameter. Note that the gamma parameter 1s specific to kernel 
SVMs. After we used the training data to perform the grid search, we obtained the score of the best- 
performing model via the best score attribute and looked at its parameters, that can be accessed 
via the best params_ attribute. In this particular case, the linear SVM model with 'clf c'= 0.1' 
yielded the best k-fold cross-validation accuracy: 97.8 percent. 


Finally, we will use the independent test dataset to estimate the performance of the best selected 
model, which 1s available via the best estimator attribute of the GridSearchcv object: 


er CLE = OS.Dect estimator . 

por Clie tle lm Teetn, 7 Praia, 

Per PEAMC( Tes SBCCurecy: esl” ~~ Cli.,eCOore(. test, VY tese)) 
Test accuracy: 0.965 


Note 


Although grid search is a powerful approach for finding the optimal set of parameters, the evaluation 


of all possible parameter combinations is alsqyeqmpitafionally very expensive. An alternative 
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approach to sampling different parameter combinations using scikit-learn is randomized search. Using 
the RandomizedSearchcv Class 1n scikit-learn, we can draw random parameter combinations from 
sampling distributions with a specified budget. More details and examples for its usage can be found 
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Algorithm selection with nested cross-validation 


Using k-fold cross-validation in combination with grid search is a useful approach for fine-tuning the 
performance of a machine learning model by varying its hyperparameters values as we saw in the 
previous subsection. If we want to select among different machine learning algorithms though, another 
recommended approach 1s nested cross-validation, and in a nice study on the bias 1n error estimation, 
Varma and Simon concluded that the true error of the estimate is almost unbiased relative to the test 
set when nested cross-validation 1s used (S. Varma and R. Simon. Bias in Error Estimation When 
Using Cross-validation for Model Selection. BMC bioinformatics, 7(1):91, 2006). 


In nested cross-validation, we have an outer k-fold cross-validation loop to split the data into training 
and test folds, and an inner loop is used to select the model using k-fold cross-validation on the 
training fold. After model selection, the test fold is then used to evaluate the model performance. The 
following figure explains the concept of nested cross-validation with five outer and two inner folds, 
which can be useful for large data sets where computational performance 1s important; this particular 
type of nested cross-validation is also known as 5x2 cross-validation: 








Outer loop 


Train with optimal 
parameters 


Training fold Validation fold 


Inner loop 


Tune parameters 





In scikit-learn, we can perform nested cross-validation as follows: 


ee? Os = Gricocarcnucy (estimalor=pipe svc, 
param grid=param grid, 
scoring='accuracy', 


Ccv=10, 
ee i Vobs=—1) 
Per BCOLes = Cross Var SCOLre (Gs; A, Vr SCOPING="accuracy’, v=o) 


Poe PrIne( "CV accuracy: s2sf +7 2.35" = { 
np.mean(scores), np.std(scores) ) ) 


CY acctracy: 0.978 +7= 0.012 
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The returned average cross-validation accuracy gives us a good estimate of what to expect if we tune 
the hyperparameters of a model and then use it on unseen data. For example, we can use the nested 
cross-validation approach to compare an SVM model to a simple decision tree classifier; for 
simplicity, we will only tune its depth parameter: 


>>> from sklearn.tree import DecisionTreeClassifier 

>>> gs = GridSearchcCv ( 
estimator—Decision reeUlassifierizandom Stave—0); 
param grid=[ 


Max Cepia’. ly 2y 2, 4, DO, ©, Fe Nore] tly 
scoring='accuracy', 
oe CV=)) 
Ber SCOLes = GCilOcs Valk Score(cs, 


x ea tity 
VY tiein, 
SCOrimg="Aaccuracy’,; 
fo & cv=5) 
Soo print CV accuracy< cseot 7/— 2.3" S { 
eae np.mean(scores), np.std(scores) ) ) 
CV eccuracy: 0.908 +7= 0.045 


As we can see here, the nested cross-validation performance of the SVM model (97.8 percent) 1s 
notably better than the performance of the decision tree (90.8 percent). Thus, we'd expect that it might 
be the better choice for classifying new data that comes from the same population as this particular 
dataset. 
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Looking at different performance evaluation 
metrics 


In the previous sections and chapters, we evaluated our models using the model accuracy, which is a 
useful metric to quantify the performance of a model in general. However, there are several other 
performance metrics that can be used to measure a model's relevance, such as precision, recall, and 
the Fl-score. 
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Reading a confusion matrix 


Before we get into the details of different scoring metrics, let's print a so-called confusion matrix, a 
matrix that lays out the performance of a learning algorithm. The confusion matrix 1s simply a square 
matrix that reports the counts of the true positive, true negative, false positive, and false negative 
predictions of a classifier, as shown in the following figure: 


Predicted class 
P N 


‘True False 
Positives Negatives 
(TP) (FN) 


False True 
Positives Negatives 
(FP) (TN) 





Although these metrics can be easily computed manually by comparing the true and predicted class 
labels, scikit-learn provides a convenient confusion matrix function that we can use as follows: 


Pro LFOmM SKIGarn.MELIICcSs amporl contusion Matrix 
27? DIPS SVC.TIU(s. train, Yo train) 
Pee VY Pree. = pipe SVC. proatece( a Lest) 


27 COntMatl = Conrusiton Metrix(y tCrue=-y Lest; Y prec=y pred) 
Por Dre LCOnLMat) 
[[71 1] 

[ 2 40] ] 


The array that was returned after executing the preceding code provides us with information about the 
different types of errors the classifier made on the test dataset that we can map onto the confusion 
matrix illustration in the previous figure using matplotlib's mat show function: 


Sor FAG, ax = DLT. subplous(ttosizZe=(2.5, 2s d)) 
>>> ax.matshow(confmat, cmap=plt.cm.Blues, alpha=0.3) 
>>> for 1 in range(confmat.shape[Q]): 
for Jj] in range(confmat.shape[1]): 
ax.text(x=jJ, y=l1, 
s=confmat[i, jl, 
ees va='center', ha='center') 
>>> plt.xlabel ("predicted label') 


>>> plt.ylabel('true label') WOW! eBook 
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>>> plt.show() 


Now, the confusion matrix plot as shown here should make the results a little bit easier to interpret: 


predicted label 





Assuming that class 1 (malignant) is the positive class in this example, our model correctly classified 
71 of the samples that belong to class 0 (false negatives) and 40 samples that belong to class 1 (true 
positives), respectively. However, our model also incorrectly misclassified 2 samples from class 0 
as class | (false negatives), and it predicted that 1 sample is benign although it is a malignant tumor 
(false positive). In the next section, we will learn how we can use this information to calculate 
various different error metrics. 
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Optimizing the precision and recall of a classification 
model 


Both the prediction error (ERR) and accuracy (ACC) provide general information about how many 
samples are misclassified. The error can be understood as the sum of all false predictions divided by 
the number of total predications, and the accuracy 1s calculated as the sum of correct predictions 
divided by the total number of predictions, respectively: 


PPEePN 


ERR =——___—__ 
FP+FN+TP+IN 


The prediction accuracy can then be calculated directly from the error: 
IP+iN 


= —_____—______ =1-ERR 
FP +FN+TP+IN 


AGC : 


The true positive rate (TPR) and false positive rate (FPR) are performance metrics that are 
especially useful for imbalanced class problems: 


peat 
N #&P+IN 
rpratt.-—/! 
P FN4IP 


In tumor diagnosis, for example, we are more concerned about the detection of malignant tumors 1n 
order to help a patient with the appropriate treatment. However, it is also important to decrease the 
number of benign tumors that were incorrectly classified as malignant (false positives) to not 
unnecessarily concern a patient. In contrast to the FPR, the true positive rate provides useful 
information about the fraction of positive (or relevant) samples that were correctly identified out of 


the total pool of positives (P). WOW! eBook 
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Precision (PRE) and recall (REC) are performance metrics that are related to those true positive and 
true negative rates, and in fact, recall 1s synonymous to the true positive rate: 


Pee = — 
iP +P 
BEC = IPR <2 
P FN+1IP 


In practice, often a combination of precision and recall is used, the so-called Fl-score: 


Fl=? PREx REC 
PRE+ REC 


These scoring metrics are all implemented in scikit-learn and can be imported from the 
sklearn.metrics module, as shown in the following snippet: 


oor Z£eom SKIGari.MSlIiCcs. 1Mpork precision. Score 

por EEOM SkhbealiwMelLriCe dmpOrt Fecal! score, £1 score 
Ber PEIN GC PCeCCISslOns cso8 © PeeCieion Seote, 

bis Y Urue=y Test, Y prco=y_pred)) 
Preecisi0On: Us 976 
Zoo Priel" ReCalLS cio8 @ PSCal Score ( 

ee Vy UlUe=y Cost, YY Dreo=y prea) 
Recalls Us 92 
por Die. “ceo” “o jo CC rey 

eas VY Covey tesk, Yooreo=-y pred) ) 
Pls Ow 264 


Furthermore, we can use a different scoring metric other than accuracy in GridSearch via the scoring 
parameter. A complete list of the different values that are accepted by the scoring parameter can be 


found at http://scikit-learn.org/stable/modules/model_ evaluation.html. 


Remember that the positive class in scikit-learn is the class that 1s labeled as class 1. If we want to 
specify a different positive label, we can construct our own scorer via the make scorer function, 


which we can then directly provide as an arguypgnt tothe scoring parameter in GridSearchCv: 
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Pro EPLOm SklearmemMecCriCs Import Make scorer, it). Score 

FoF SCOLet = Make scorer (i! Score, pos 1abe1-0) 

Boo Os = Gricocarcncy (estimacor=pipe Svc; 
Patan. Gr i1G-=paream grid, 
scoring=scorer, 
cv=10) 
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Plotting a receiver operating characteristic 


Receiver operator characteristic (ROC) graphs are useful tools for selecting models for 
classification based on their performance with respect to the false positive and true positive rates, 
which are computed by shifting the decision threshold of the classifier. The diagonal of an ROC graph 
can be interpreted as random guessing, and classification models that fall below the diagonal are 
considered as worse than random guessing, A perfect classifier would fall into the top-left corner of 
the graph with a true positive rate of 1 and a false positive rate of 0. Based on the ROC curve, we can 
then compute the so-called area under the curve (AUC) to characterize the performance of a 
classification model. 


Note 


Similar to ROC curves, we can compute precision-recall curves for the different probability 
thresholds of a classifier. A function for plotting those precision-recall curves is also implemented in 
scikit-learn and is documented at http://scikit- 
learn.org/stable/modules/generated/sklearn.metrics.precision recall curve.html. 





By executing the following code example, we will plot an ROC curve of a classifier that only uses 
two features from the Breast Cancer Wisconsin dataset to predict whether a tumor is benign or 
malignant. Although we are going to use the same logistic regression pipeline that we defined 
previously, we are making the classification task more challenging for the classifier so that the 
resulting ROC curve becomes visually more interesting. For similar reasons, we are also reducing the 
number of folds in the stratifiedkKFold validator to three. The code is as follows: 


27> LOM SKLGatn.MelLrICs. 1MpOre FOC Curve, auc 
>> £LrOm SCIDY GMpOorl Interp 
oo? & CrainZ = X% Crainls, (4, 241] 
yo Gy = vo cee Penk OL 7 Viera, 
Mm tOlds=3, 
es Fandom state— 1} 
Poe Tig = Plls«T1 Gure (tigsize= (i, 3).) 
Per mean Tor = 0.0 
eo Nea, Foe = ioaliwe pace, dy. 100) 
Pe Gd. pe = I 


>>> for 1, (train, test) in enumerate (cv): 


probes = pipe -r.fic(x train | crain)|, Soo 
VY Crean (eral) )eprecict proba (x trarnZz | test | ) 
EOt, tiie Tames lolcs =| FOC Curley Urata eeoel, 


DEObasi se, Lis 
pos Jabel=2) 
Mean ~pr r= 2ntero(mean Tpr, Tpr, Cpr) 
mean epee lO] = 0.0 
LOC 2uCc = @auc(iIpr, <pr) 
DLE .oLOL io, 
is @ sage 


lw=1 
WOW! eB 
label='ROC fold %wabpabsokistg2£) ’ 


>>> 


>>> 
>>> 
>>> 
>>> 


>>> 


ee 
>>> 
>>> 
>>> 
>>> 
>>> 
Ao? 


OL 


e) 


eo (it, ©oOC auc, ) 
jon Re om a kee de ly 
hoy i 
INES ty Lée=" ==", 
Color=(0..6, 0.6, 0.6)., 
label='random guessing") 


mean tpr /= len(cv) 


Mean, tpri=L) = 1.0 
mean auc = auc(mean pr, mean Cpr) 
plt.plot (mean fpr, mean tpr, *k--", 
Lavpel="Meen BOG (area = 20.28)" = mean auc, Lw=zZ) 
DiEs~p LoL, GU, i, 
[O, 1, 1] / 
lw=2, 
linestyle=':', 
COlLOr="Dlack y 
label='perfect performance') 
DLE.x lim L—0.05,7 L205) } 
Dltsyiam( [=0.05, 1495]) 
plit.xlabel('false positive rate') 
plt.ylabel ('true positive rate') 
plt.title('Receiver Operator Characteristic') 
plt.legend(loc="l1ower right") 
plit.show () 


In the preceding code example, we used the already familiar st ratifiedKFold class from scikit- 
learn and calculated the ROC performance of the LogisticRegression Classifier in our pipe lr 
pipeline using the roc curve function from the sklearn.metrics module separately for each 
iteration. Furthermore, we interpolated the average ROC curve from the three folds via the interp 
function that we imported from SciPy and calculated the area under the curve via the auc function. 
The resulting ROC curve indicates that there 1s a certain degree of variance between the different 
folds, and the average ROC AUC (0.75) falls between a perfect score (1.0) and random guessing 
(0.5): 
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Receiver Operator Characteristic 


true positive rate 


ROC fold 1 (area = 0.69) 
ROC fold 2 (area = 0.78) 
ROC fold 3 (area = 0.76) |4 
random guessing 

mean ROC (area = 0.75) 
perfect performance 





0.0 0.2 0.4 0.6 0.8 1.0 
false positive rate 


If we are just interested in the ROC AUC score, we could also directly import the roc auc score 
function from the sklearn.metrics submodule. The following code calculates the classifier's ROC 
AUC score on the independent test dataset after fitting it on the two-feature training set: 


27? Pipe SVG = Pipe Svc. ivutxe CrainZ, YY train) 
Poe FV Preds, = pipe SvVCc.predictu(x testis, |4, 12)))) 


27> TLEOM Skbearn.Meltracs Import. BOC @uCc Score 
Por from SkleGarn Metrics AMpOre accuracy Score 
PA? Piet ROC BUCS weet ~@ LOC 2UC Seore 

ica a VY TPue=7 CSsl, VY .SCore=y preaZ)) 
ROC. AUCS 0.671 

Por PYINe( RCCUracy: cust” =o accuracy Score | 
<2 VY true-y test, Y predc=y_ predz).) 
ACCUuracy: O«7126 


Reporting the performance of a classifier as the ROC AUC can yield further insights 1n a classifier's 
performance with respect to imbalanced samples. However, while the accuracy score can be 
interpreted as a single cut-off point on a ROC curve, A. P. Bradley showed that the ROC AUC and 
accuracy metrics mostly agree with each other (A. P. Bradley. The Use of the Area Under the ROC 
Curve in the Evaluation of Machine Learning Algorithms. Pattern recognition, 30(7):1145—1159, 
1997). 
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The scoring metrics for multiclass classification 


The scoring metrics that we discussed in this section are specific to binary classification systems. 
However, scikit-learn also implements macro and micro averaging methods to extend those scoring 
metrics to multiclass problems via One vs. All (OvA) classification. The micro-average 1s 
calculated from the individual true positives, true negatives, false positives, and false negatives of the 
system. For example, the micro-average of the precision score in a k-class system can be calculated 
as follows: 


PP ++ TP, 
at a = ——————————— 
© IES ees eau bb Pe 


The macro-average 1s simply calculated as the average scores of the different systems: 
—— PRE, ++ PRE, 
rite = + 


racer 
h 


Micro-averaging 1s useful 1f we want to weight each instance or prediction equally, whereas macro- 
averaging weights all classes equally to evaluate the overall performance of a classifier with regard 
to the most frequent class labels. 


If we are using binary performance metrics to evaluate multiclass classification models in scikit- 
learn, a normalized or weighted variant of the macro-average 1s used by default. The weighted 
macro-average 1s calculated by weighting the score of each class label by the number of true 
instances when calculating the average. The weighted macro-average is useful if we are dealing with 
class imbalances, that is, different numbers of instances for each label. 


While the weighted macro-average 1s the default for multiclass problems in scikit-learn, we can 
specify the averaging method via the average parameter inside the different scoring functions that we 
import from the sklean.metrics module, for example, the precision score Of make scorer 
functions: 


27 DiS SCOLrer = Meke Scorer (score F[uncC=precisi10n SCore, 
pos label=l1, 
Oteeter 18 Deller — rue, 
average='micro') 
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Summary 


In the beginning of this chapter, we discussed how to chain different transformation techniques and 
classifiers in convenient model pipelines that helped us to train and evaluate machine learning models 
more efficiently. We then used those pipelines to perform k-fold cross-validation, one of the essential 
techniques for model selection and evaluation. Using k-fold cross-validation, we plotted learning and 
validation curves to diagnose the common problems of learning algorithms, such as overfitting and 
underfitting. Using grid search, we further fine-tuned our model. We concluded this chapter by looking 
at a confusion matrix and various different performance metrics that can be useful to further optimize 
a model's performance for a specific problem task. Now, we should be well-equipped with the 
essential techniques to build supervised machine learning models for classification successfully. 


In the next chapter, we will take a look at ensemble methods, methods that allow us to combine 
multiple models and classification algorithms to boost the predictive performance of a machine 
learning system even further. 


WOW! eBook 
www.wowebook.org 


Chapter 7. Combining Different Models for 
Ensemble Learning 


In the previous chapter, we focused on the best practices for tuning and evaluating different models 
for classification. In this chapter, we will build upon these techniques and explore different methods 
for constructing a set of classifiers that can often have a better predictive performance than any of its 
individual members. You will learn how to: 


e Make predictions based on majority voting 
e Reduce overfitting by drawing random combinations of the training set with repetition 
e Build powerful models from weak learners that learn from their mistakes 
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Learning with ensembles 


The goal behind ensemble methods is to combine different classifiers into a meta-classifier that has 
a better generalization performance than each individual classifier alone. For example, assuming that 
we collected predictions from 10 experts, ensemble methods would allow us to strategically combine 
these predictions by the 10 experts to come up with a prediction that 1s more accurate and robust than 
the predictions by each individual expert. As we will see later in this chapter, there are several 
different approaches for creating an ensemble of classifiers. In this section, we will introduce a basic 
perception about how ensembles work and why they are typically recognized for yielding a good 
generalization performance. 


In this chapter, we will focus on the most popular ensemble methods that use the majority voting 
principle. Majority voting simply means that we select the class label that has been predicted by the 
majority of classifiers, that 1s, received more than 50 percent of the votes. Strictly speaking, the term 
majority vote refers to binary class settings only. However, it is easy to generalize the majority 
voting principle to multi-class settings, which is called plurality voting. Here, we select the class 
label that received the most votes (mode). The following diagram illustrates the concept of majority 
and plurality voting for an ensemble of 10 classifiers where each unique symbol (triangle, square, and 
circle) represents a unique class label: 


@@e6e06e0e060060080 80 Unanimity 
SOOO OAAA A  Piiority 


@@eeEAAA |! || Plurality 





Using the training set, we start by training m different classifiers ( Ciro l, ). Depending on the 
technique, the ensemble can be built from different classification algorithms, for example, decision 
trees, support vector machines, logistic regression classifiers, and so on. Alternatively, we can also 
use the same base classification algorithm fitting different subsets of the training set. One prominent 
example of this approach would be the random forest algorithm, which combines different decision 
tree classifiers. The following diagram illustrates the concept of a general ensemble approach using 
majority voting: 
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Training set 


Classification 
models 


EJEP Mo 


Predictions 


Final prediction 





To predict a class label via a simple majority or plurality voting, we combine the predicted class 


labels of each individual classifier t, and selectdnerclass label @ichat received die mostvoies: 


y = mode ea 630 one) oe (x)! 


om Ai 


For example, in a binary classification task where class1=—1 anq class2=+1. we can write the 
majority vote prediction as follows: 


| rH | F i C fa. > 0) 
C(x) = sign bs C, ( “) = | Y aa J ( r) ~ 


—| otherwise 


To illustrate why ensemble methods can work better than individual classifiers alone, let's apply the 
simple concepts of combinatorics. For the following example, we make the assumption that all 1 base 
classifiers for a binary classification task have an equal error rate © . Furthermore, we assume that 
the classifiers are independent and the error rates arg not correlated. Under those assumptions, we 
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can simply express the error probability of an ensemble of base classifiers as a probability mass 
function of a binomial distribution: 


i n 


P(y 2 k) — » k E" (1 a s)"" = © ensomble 


fl 
Here, a is the binomial coefficient n choose k. In other words, we compute the probability that the 
prediction of the ensemble is wrong. Now let's take a look at a more concrete example of 11 base 


classifiers (/! = I ) with an error rate of 0.25 (# = 0.25 ): 


, o /1 , 
P(y>k)=>"(_ )0.25"(1-e) "= 0.034 


As we can see, the error rate of the ensemble (0.034) 1s much lower than the error rate of each 
individual classifier (0.25) 1f all the assumptions are met. Note that, in this simplified illustration, a 
50-50 split by an even number of classifiers 7 1s treated as an error, whereas this 1s only true half of 
the time. To compare such an idealistic ensemble classifier to a base classifier over a range of 
different base error rates, let's implement the probability mass function in Python: 


Soo TYOM SCIDY.mLSCc AmpOrl Comb 
2o> Import Math 
oo OS ensemble Crrorin Classifier, SrEnor): 


k start = math.ceil(n classifier / 2.0) 
probs = [comb Clacsiiier, ) * 
Srrorre7k 4 
Pier Olye* (i tele oer = 


Lot fit Ponoe . Scare, m Class Tier a B)] 
_ return sum(probs) 
Sor SUSsomole CLror(n Classi tver-iil;, Cirror=U.2)) 
O4.034327 750 7019042969 


After we've implemented the ensemble error function, we can compute the ensemble error rates for 
a range of different base errors from 0.0 to 1.0 to visualize the relationship between ensemble and 
base errors ina line graph: 


>>> import Nnumpy as np 
Per GITOr Tange = Np.erange(U.s0, IseUl;, 0.01) 
Pa Eile Serors = ensemble Crrerin ClasSoiLtctell; CrroOreer ro) 


for error in errorWOdigBdok 
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Zero IMNDOre. MatplLOoULiD.pyploe. as pabt 

yor DIiteDIOL (error Mange, Gns Crrore, 
label='Ensemble error', 

oe linewidth=2) 

27> PiltsPLOE(eTrOr Lange, @rror range, 
linestyle='--', label='Base error', 

oe linewidth=2) 

>>> plt.xlabel('Base error') 

>>> plt.ylabel('Base/Ensemble error') 

>>> plt.legend(loc='"upper left') 

Zo Pit eOr id.) 

>>> plt.show() 


As we can see 1n the resulting plot, the error probability of an ensemble 1s always better than the 
error of an individual base classifier as long as the base classifiers perform better than random 


guessing ( < 0.5 ). Note that the y-axis depicts the base error (dotted line) as well as the ensemble 
error (continuous line): 


1.0 


— Ensemble error 
== Base error 


= ° ° 
mie ch co 


Base/Ensemble error 


— 
rh) 





0.0 0.2 0.4 0.6 0.8 1.0 
Base error 
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Implementing a simple majority vote classifier 


After the short introduction to ensemble learning in the previous section, let's start with a warm-up 
exercise and implement a simple ensemble classifier for majority voting in Python. Although the 
following algorithm also generalizes to multi-class settings via plurality voting, we will use the term 
majority voting for simplicity as is also often done in literature. 


The algorithm that we are going to implement will allow us to combine different classification 
algorithms associated with individual weights for confidence. Our goal is to build a stronger meta- 
classifier that balances out the individual classifiers' weaknesses on a particular dataset. In more 
precise mathematical terms, we can write the weighted majority vote as follows: 


rig : 
J = arg max » WX, (C : ( x) = i 
j=l 


Here, i isa weight associated with a base classifier, ~’, - is the predicted class label of the 


x Ae a | |C, (x) =e A| | | 
ensemble, * (Greek chi) is the characteristic function’ /* ° , and A is the set of unique 
class labels. For equal weights, we can simplify this equation and write it as follows: 


y = mode C, ( x) ? C, ( x) ee Cn (x) 


To better understand the concept of weighting, we will now take a look at a more concrete example. 

| C. jeioan | 
Let's assume that we have an ensemble of three base classifiers ~! (/ Pe and want to predict 
the class label of a given sample instance x. Two out of three base classifiers predict the class label 


QO, and one C predicts that the sample belongs to class 1. If we weight the predictions of each base 
classifier equally, the majority vote will predict that the sample belongs to class 0: 


C(x) 30, C,(x) 30, C,(x)>1 


y =mode 0, 0, \ =) 
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Now let's assign a weight of 0.6 to C; and weight C, and C, by a coefficient of 0.2, respectively. 


i 


y=arg max » WwW, Li (<, (x) =] 
j=l 


= are max!0.2xi,+0.2x1,+0.6xi|=1 
L ( 0 l 


3x02 


More intuitively, since =0.0 we can say that the prediction made by C; has three times more 


weight than the predictions by ©, or t, , respectively. We can write this as follows: 


} =mode{0,0,1,1,1} =1 


To translate the concept of the weighted majority vote into Python code, we can use NumPy's 
convenient argmax and bincount functions: 


>>> import numpy as np 

Po > Np eal gmax(np.bancount ( (0, 0, Li; 
ee weights=[0.2, 0.2, 0.6])) 
i. 


As discussed in Chapter 3, 4 Tour of Machine Learning Classifiers Using Scikit-learn, certain 
classifiers in scikit-learn can also return the probability of a predicted class label via the 

predict proba method. Using the predicted class probabilities instead of the class labels for 
majority voting can be useful 1f the classifiers 1n our ensemble are well calibrated. The modified 
version of the majority vote for predicting class labels from probabilities can be written as follows: 


i 


y = arg max be Wp. 
i 
j=l 


Here, Pi is the predicted probability of the ithe poassitierofor class label 7. 


To continue with our previous example, let's assume that we have a binary classification problem 


j e123 


re /0.1' | 
with class labels ° ~ le and an ensemble of three classifiers : '( ““?~S). Let's assume that the 


C ' returns the following class membership probabilities for a particular sample ~* : 


C,(x)>[0.9,0.1], C, (x) > [0.8,0.2], C, (x) >[0.4, 0.6] 


classifier 


We can then calculate the individual class probabilities as follows: 


x) =0.2x0.9+0.2 x0.8+0.6x 0.4 =0.58 





Pig 


p(i, |x) =0.2x0.1+0.2x0.2 + 0.6 x 0.06 = 0.42 





x) | = 





j=aramaxl pig) x). li 


To implement the weighted majority vote based on class probabilities, we can again make use of 
NumPy using numpy.average and np. argmax: 


2o> 6X = Tip.array (0.9, Vell, 
Oveog tae 
5 ee [0.4, 0.6] ]) 
>>> p = np.average(ex, axis=0, weights=[0.2, 0.2, 0.6]) 
> © 


array([ 0.38, 0.42]) 
>>> np.argmax (p) 
0 


Putting everything together, let's now implement a MajorityVoteClassifier in Python: 


from sklearn.base import BaseEstimator 

from sklearn.base import ClassifierMixin 

from sklearn.preprocessing import LabelEncoder 
from sklearn.externals import six 

from sklearn.base import clone 

trom Sklearn.pipeline 1MpOr: Name SStimavors 


import numpy as np WOW! eBook 
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import operator 


class MajorityVoteClassifier (BaseEstimator, 
ClassifierMixin): 
mw A majority vote ensemble classifier 


Parameters 
Cleeoiiters | crea ttke, Shape = I Closet er! 
Different classifiers for the ensemble 


vote * str; {*Classlabel”;, *probabiilacy’ } 
Default: 'classlabel' 
If 'classlabel' the prediction is based on 
the argmax of class labels. Else if 
"orobability', the argmax of the sum of 
probabilities is used to predict the class label 
(recommended for calibrated classifiers). 


WEIlICKES ¢ @array-l1ke, shape = In Classifiers 
Optional, default: None 
Tf a list of ‘int’ or ‘float’ values are 
provided, the classifiers are weighted by 
importance; Uses uniform weights if “weights=None . 


Woy vy 


CeG. 26  teett, GCloesoittter, 
vote='classlabel', weights=None): 


self.classifiers = classifiers 
selft.named classifiers = {key: value for 
key, value in 
Name. SStimeaborse (Classifiers) | 
self.vote = vote 
self.weights = weights 


def fit(self, X, y): 
myw Fit classifiers. 


Palamevers 
X : {array-like, sparse matrix}, 
shape = [1 samples, nN feacures| 


Matrix of training samples. 


VY > eltay-Like; Shape = 1h. Samples) 
Vector of target class labels. 


Returns 


self : object 


woes 
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# Use LabelEncoder to ensure class labels start 
# with 0O, which is important for np.argmax 
# call in self.predict 


Selt,teablene = JiabelLancodcer () 
Selfstablenc. «fic (y) 
eeliechasses = seli.idaDlLenc «Classes. 


Se ecClaeetiiere . = [I 
for clf in self.classifiers: 
CiClee Cit = Clonee ype, 
Se lt. eabbenc «trans lorm (y).) 
Se li .Classifiers s.eppend(tEetled Clr) 
return self 


I added a lot of comments to the code to better understand the individual parts. However, before we 
implement the remaining methods, let's take a quick break and discuss some of the code that may look 
confusing at first. We used the parent classes BaseEstimator and ClassifierMixin to get some 
base functionality for free, including the methods get params and set params to set and return the 
classifier's parameters as well as the score method to calculate the prediction accuracy, respectively. 
Also note that we imported six to make the MajorityVoteClassifier compatible with Python 2.7. 


Next we will add the predict method to predict the class label via majority vote based on the class 
labels 1f we initialize a new MajorityVoteClassifier object with vote='classlabel'. 
Alternatively, we will be able to initialize the ensemble classifier with vote='probability' to 
predict the class label based on the class membership probabilities. Furthermore, we will also add a 
predict proba method to return the average probabilities, which is useful to compute the Receiver 
Operator Characteristic area under the curve (ROC AUC). 


def predict(self, X): 
nym Predict class labels for xX. 


PalaMmeLers 
X : {array-like, sparse matrix}, 
pitepe = Im Samples, 1 fearurves| 


Matrix of training samples. 


Returns 


May vOte 3} array-like, Shape = [nm samples] 
Predicted class labels. 


woes 


if self.vote == 'probability': 
May Voce = 1p.argmax(Sel7 sprecicle proba(xX), 
axis=1) 
else: # 'classlabel' vote 


# Collect results from clf.predict calls 
predictions = np.asarray([clf.predict (X) 
for CLF an 
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May} VOCS = NPpsapply along axis ( 
lambda x: 
np.argmax (np. bincount (x, 
weights=self.weights)), 
axis=l, 
arr=predictions) 
May Vole = selft.teadlenc .1anverse TCransrorm (ma) Vote) 
return maj vote 


eer preqdice proba(selt, xX): 
uu Preadice. Glass probabilities for xX, 


Paramelers 
X : {array-like, sparse matrix}, 
shape = [nm samples, n features) 


Tet oing Vectors; Where TM. Samples as 
the number of samples and 
iY Teetures we Une Mumber OF Pearures: 


RELULIIS 
avg proba. ; array-like, 
shape = |. samples, mm Classes] 


Weighted average probability for 
each class per sample. 


woes 


probes = Np.asarray( |(Cli.~precice proba x) 
Lor Git 2h Selec loceti ters. |, 
avg proba = Npuaverage (probes, 


axis=0, weights=self.weights) 
return avg proba 


OC OCU Param (eel, Ceeo- iat. 
mm Get classifier parameter names for GridSearch""™" 
1f not deep: 
return super (MajorityVoteClassifier, 
Sell) sgGeu, Perams (OSep=False) 
else: 
One = Seliepanecd). (less eros Col) 
for name, step in\ 
Si x<.41 Certeems (sel ,.namec: Classifiers): 
for key, value in six.iteritems ( 
Step.9el Params (Geep—irue).): 
out['Ss %s' % (name, key)] = value 
return out 


Also, note that we defined our own modified version of the get params methods to use the 

_name estimators function in order to access the parameters of individual classifiers in the 
ensemble. This may look a little bit complicated at first, but it will make perfect sense when we use 
erid search for hyperparameter-tuning in later\geotiensok 
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Note 


Although our MajorityVoteClassifier implementation 1s very useful for demonstration purposes, I 
also implemented a more sophisticated version of the majority vote classifier in scikit-learn. It will 
become available as sklearn.ensemble.VotingClassifier in the next release version (v0.17). 


WOW! eBook 
www.wowebook.org 


Combining different algorithms for classification with 
majority vote 


Now itis about time to put the Maj orityVoteClassifier that we implemented in the previous 
section into action. But first, let's prepare a dataset that we can test 1t on. Since we are already 
familiar with techniques to load datasets from CSV files, we will take a shortcut and load the Iris 
dataset from scikit-learn's dataset module. Furthermore, we will only select two features, sepal width 
and petal length, to make the classification task more challenging. Although our 
MajorityVoteClassifier generalizes to multiclass problems, we will only classify flower samples 
from the two classes, Iris-Versicolor and Iris-Virginica, to compute the ROC AUC. The code 1s as 
follows: 


>>> from sklearn import datasets 

zoe LEON Sk leariw.CroOss ValloalLion ampore Lrain test split 
>>> from sklearn.preprocessing import StandardScaler 

>>> from sklearn.preprocessing import LabelEncoder 

2? Bee = Cebee econo Arise) 

Seo Ky Yo = Dele sCara lols, Ll, Zilles 2s. targec lous] 

>>> le = LabelEncoder () 

27> VY =. deeEat Cranstorm(y) 


Note 


Note that scikit-learn uses the predict proba method (if applicable) to compute the ROC AUC 
score. In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we saw how the 
class probabilities are computed in logistic regression models. In decision trees, the probabilities are 
calculated from a frequency vector that is created for each node at training time. The vector collects 
the frequency values of each class label computed from the class label distribution at that node. Then 
the frequencies are normalized so that they sum up to 1. Similarly, the class labels of the k-nearest 
neighbors are aggregated to return the normalized class label frequencies in the k-nearest neighbors 
algorithm. Although the normalized probabilities returned by both the decision tree and k-nearest 
neighbors classifier may look similar to the probabilities obtained from a logistic regression model, 
we have to be aware that these are actually not derived from probability mass functions. 


Next we split the Iris samples into 50 percent training and 50 percent test data: 


>>> X train, X_ test, y_ train, y test =\ 
train test split(X, y, 
test 8146-045; 
random. Stave=1) 


Using the training dataset, we now will train three different classifiers—a logistic regression 
classifier, a decision tree classifier, and a k-nearest neighbors classifier—and look at their individual 
performances via a 10-fold cross-validation on the training dataset before we combine them into an 
ensemble classifier: 


| | _ WOW! eBook 
>>> Irom sklearn.cross_ validation wAhPQbWebédiGérg’al_score 


yee EYOM Ski Gatnslanear Model tmport logistickhegression 
>>> from sklearn.tree import DecisionTreeClassifier 
>>> from sklearn.neighbors import KNeighborsClassifier 
>>> from sklearn.pipeline import Pipeline 
>>> import numpy as np 
>>> clfl = LogisticRegression(penalty='12', 
C=0.001, 

e435 fandom State=0) 
eo CliZ. = DECiSsl1Onl Lee Lasot iter (mex: Gepraa=l, 

criterion='entropy', 
—s random Stalte=V0) 
eee Clio = KNETONDOreC Classi teri Heronbere=1, 


p=2, 
es metric='minkowski'") 
Por Dapel, = Pipeline (| sc"; Standardocaler() |, 
ers Lele, CLI1y jf) 
>>> pipe3 = Pipeline([['sc StandardScaler()], 
bane ['clf*, clf3]]) 
27> Cl Labels = | “logistic Regressi0m,; “Decision Tree"; *“KNN” | 


Poe PrInee LO=fold Crocs Validacion? 1") 
27 Or Clr, Label 2 Zip lpipel, Cliz, pipes), Cli epee). 


a2 SCOLGs = Cross Val. Score estimeror=cli, 

>>> XX Urain, 

>>> y=y train, 

>>> Cv=L0, 

ee SCOrIngG="10C auc") 
>>> print("ROC AUC: %0.2f£ (+/- %$0.2f) [%s]" 


oO 


> (scores.mean(), scores.std(), label) ) 


The output that we receive, as shown 1n the following snippet, shows that the predictive performances 
of the individual classifiers are almost equal: 


10-fold cross validation: 


ROC AUC: 0.92 (+/- 0.20) [Logistic Regression] 
ROC AUC: 0.92 (+/- 0.15) [Decision Tree] 
ROC AUC: 0.93 (+/- 0.10) [KNN] 


You may be wondering why we trained the logistic regression and k-nearest neighbors classifier as 
part of a pipeline. The reason behind it 1s that, as discussed in Chapter 3, A Tour of Machine 
Learning Classifiers Using Scikit-learn, both the logistic regression and k-nearest neighbors 
algorithms (using the Euclidean distance metric) are not scale-invariant 1n contrast with decision 
trees. Although the Iris features are all measured on the same scale (cm), it is a good habit to work 
with standardized features. 


Now let's move on to the more exciting part and combine the individual classifiers for majority rule 
voting 1n Our MajorityVoteClassifier: 


yo TV CLT = MajyorilyVoOveC assur ier ( 
ios classifiers=[pipel, clf2, pipe3]) 
So> Clit Jtebels += ("Majority Voting” .| 


S>> all Gle = [pipel, clt2, pipe3, mVOE £00 & 
— www.wowebook.org 


Roe TOLr Cit, abel am 22piall Cit, cle Jebels) = 
SCOLes = CLoOss Va! Score (esU mero Citi, 
K=h Chal, 
y=y train, 
Ccv=10, 
SCOLEING="TOC 2uc") 
Princ ("Accuracy: 20.2t (f7/— S0sZt) [ee)-" 
sa 6 (scores.mean(), scores.std(), label) ) 
ROG AUG: 


0.92 (+/- 0.20) [Logistic Regression] 
ROC AUC: 0.92 (+/- 0.15) [Decision Tree] 
ROC AUC: 0.93 (4+/- 0.10) [KNN] 
ROC AUC: 0.97 (+/- 0.10) [Majority Voting] 


As we can see, the performance of the MajorityVotingClassifier has substantially improved over 
the individual classifiers in the 10-fold cross-validation evaluation. 
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Evaluating and tuning the ensemble classifier 


In this section, we are going to compute the ROC curves from the test set to check if the 
MajorityVoteClassifier generalizes well to unseen data. We should remember that the test set 1s 
not to be used for model selection; its only purpose is to report an unbiased estimate of the 
generalization performance of a classifier system. The code 1s as follows: 


Pro? Lom Skea n.MelLELCS ampere TOC. Curve 
>>> from sklearn.metrics import auc 
>>> colors = ['black', ‘'orange', ‘'blue', ‘'green'] 
>>> linestyles = [':', '--', '-.', '=-"] 
> for Clit, Jasel, Cir, Le * 
ai Zipvell cit, Clit labels, Colors, 21nesry les). 
# assuming the label of the positive class is 1 
Vo Pred = Clie erat, 
Vy rein) sOeeOlet proba(x Eee) iy, 1) 
fpr; Upr;, thresholas = roc curvely Urue-y test, 
yy  COLe-y pred) 
COC-20C = 2c =—pe, VCD.) 
DLeeolLOolli pr, Lou, 
COlLOr—cCilr; 
linestyle=ls, 
sae Label" ss (aC = -UecE)": « (kebel, LOC auc) 
>>> plt.legend(loc="lower right") 
2 > Dleeolorl lo, Li, i, 11, 
linestyle='--', 
COoLor="Qqray", 
ee linewidth=2) 
>>> PAIL«x lam (| Tels |r) 
22> Dit ey lam (| Tg le |.) 
Poe Wher 1c.4-) 
>>> plt.xlabel('False Positive Rate') 
>>> plt.ylabel ('True Positive Rate') 
>>> plt.show () 


= ise hy 
=) 5 a 


As we can see in the resulting ROC, the ensemble classifier also performs well on the test set (ROC 
AUC = 0.95), whereas the k-nearest neighbors classifier seems to be overfitting the training data 
(training ROC AUC = 0.93, test ROC AUC = 0.86): 
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True Positive Rate 





Logistic Regression (auc = 0.92) 
Decision Tree (auc = 0.89) 

KNN (auc = 0.86) 

Majority Voting (auc = 0.95) 


0.0 0.2 0.4 0.6 0.8 1.0 
False Positive Rate 


Since we only selected two features for the classification examples, it would be interesting to see 
what the decision region of the ensemble classifier actually looks like. Although it is not necessary to 
standardize the training features prior to model fitting because our logistic regression and k-nearest 
neighbors pipelines will automatically take care of this, we will standardize the training set so that 
the decision regions of the decision tree will be on the same scale for visual purposes. The code is as 


follows: 

>>> sc = StandardScaler () 

Go we oe ed = 6C. ft Eons eo rx 2. ean) 

>>> from itertools import product 

poe —. Wn. = 2 Teen Seals, Vlei) = 2 

Po? % Max = Xtra stalii, U)«amax() + 

eer YY Ma = 2 Crain Seals, Lil emint) = 

PrP yy Max = 2% Crain. Scali, Lismaxt) a 

2 SX, VY = Np.Meshoriad(np.arange(x min, x Max, Ud); 

ee Npsarange(y Min, Y Max, 0:..1)) 

>>> f£, axarr = plt.subplots (nrows=2, ncols=2, 
sharex="col’, 
sharey=!'row' 

ae fFigsize=(7, 5)) 

Soo £Or dx, Cle, EG ty Ziotproducr (0, Lic, 0, wid, 


ok City Git ee Me) = 
Cliciit (icin Sed, “7 ead) 


AZ = Clisprecitet(npsc | xxaravel(),. yysrevel() I) 
Z= Z.reshape(xx.shape) 
axarr | tox [Ol> 2dax Tl) ] -contourt (xx, yy, ZZ, alpna— 0.3) 


axarr[idx[0O], 1idx[1l]]. “scatter {X, ceeye Sry ELetie==0, Oly 
www. websciergta | y_train= 1], 


G=" Dive", 
marker='*', 


s=50) 

axaereltoxlUl, ex bll | «Scarier (x train Sstaly train—-l, 0], 
A tein Sto ly Urein=—1, Lily 
C=" 7Eeq”, 
marker='o', 
S= 0) 


ae axace LvoxX | O-|'>. 20x) L). See EELS VEL) 
Pee Divet CXC (3.0, “44.5% 
s='Sepal width [standardized]', 
er ha='center', va='center', fontsize=12) 
ao Plti«atext (=10.5,;, 445, 
s='Petal length [standardized]', 
ha='center', va='center', 
— fontsize=12, rotation=90) 
>>> plt.show() 


Interestingly but also as expected, the decision regions of the ensemble classifier seem to be a hybrid 
of the decision regions from the individual classifiers. At first glance, the majority vote decision 
boundary looks a lot like the decision boundary of the k-nearest neighbor classifier. However, we can 


see that it is orthogonal to the y axis for 8©P4! Width 21 ict like the decision tree stump: 


Logistic Regression Decision Tree 


Petal lenath [standardized] 





= 2 =< 1 a 1 2 
Sepal width [standardized] 


Before you learn how to tune the individual classifier parameters for ensemble classification, let's 


call the get params method to get a basic idea,afihew,we can access the individual parameters 
www.wowebook.org 


inside a GridSearch object: 


>>> mv clf.get params () 
| OSC SLOT receCclLasceii iter’: Decl ston eeeeCl asst iier(Clacs Wercie-None, 
Criterion] "SntLropy , Max cCepcn—1, 
max features=None, max leaf nodes=None, min samples leaf=1, 
Maw Samples Split-Z, Man werghe tracti100 Jear=)..0, 
fancom State=U, splitier="bpesl”), 


"OCCLS LONtTreCeC lesscit rer Class. Wergnt” s None, 

"“GSCISTONUTeCSClLassiiter . Cricerion’. “Semtropy-; 

Le aed 

‘CSeCisMOncieeclasotiter oom wicate @ iD, 

MOSCLOLOUL LCCC aso tt1Cr  splitic. = “Dest, 

‘Pipe line=L*s. Pipeline (Steos=|0°se";, Slveldarcoca ler (Copy—liue, Wien Mean—-lrie, 
WIth stad=lrue))y Cli”, logisticCrRegression(C-0.001, Class weight-None, 


CVal=False, tit antercepr=i1riue, 
THESE CeCk. SCaling= lL, Wea 2eer-l00, MUTE Closes] ov; 
penaliy="I1Z*, BPancdom Stace=U, solver="liblanear’, Lol=U,000L, 
verbose=0))]), 
‘pipeline “cll. hog elichegresston(C-U.001, Class wei1gnl-Nonme,; Gual-—False, 
Tit omeercepu—i1ruc, 
ERESICeDe SCaling—-l, Max 1ber—l00, Multi. Class="Ovr’; 
penalty="12", random. state=0, solver="l1blinear’, tol=0.0001,; 
verbose=0), 
‘pipeline=1 1 .C*s U.001, 
"pipeline 1 cll. Class weiguc’: None, 
“OLoOe Liao Che alts Belee, 
Lees! 


“pipeline=| Sc With Seca". True, 

‘Pipelane=2"4 Pipelane (Steps=|(°SC*; Standards caler(copy=l(rue, With nean—i rue, 
With Sbo=irue) jy t7Cli*, KNGIGNDOTrS lasSsitier (al gorilnm="auto”, teal Si7e=-50, 
metric='minkowski', 

Metric Ppaerams—-None, 1. nNe1gnbors=l, p=-2, Wergnts—" Unt rorm 7) ]), 

‘Pipeliie=2. <8 2 Bie tonbporeC lao o ter (ea loCrt mH aieo*, Lear Size ou, 


metric='minkowski', 
MECrIC Params—-None, nH mwergnbors=l, p=2; welgnts= Uniform’), 
"Pipe tneH=Z2Z Cle clogorithn 2 “SuiLo*; 
le eed 


‘pipeline =Z sC.  WLth Ssto"s True) 


Based on the values returned by the get params method, we now know how to access the individual 
classifier's attributes. Let's now tune the inverse regularization parameter c of the logistic regression 
classifier and the decision tree depth via a grid search for demonstration purposes. The code 1s as 
follows: 


27 EEOMm SkKlGar Mer id Search aMmpOre Grlaoearency 
Po? Patams = 1° OSCLSLOncreSsClassiiter Max depth: fi, Ziy 
2s “pipeline=1 clr ©"s ([U,001, O.d, LO0.0T} 
27? GLO. = GridsearenCy (eslimalor-mvy Cli, 

param grid=params, 


cv=10, 
2 a3 SCOrmng="TroCc _auc™) 
por Gere tiie (x Cleat, “7 aii) WOW! eBook 
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After the grid search has completed, we can print the different hyperparameter value combinations 
and the average ROC AUC scores computed via 10-fold cross-validation. The code is as follows: 


Poe? TOL Params, Mean, Score, SCOres 1m GOF1G.0F1G Scores -; 

Draine ("20 .3f4/=20.2i sr" 
oa % (mean score, scores.std() / 2, params) ) 
0.967+/-0.05 {'pipeline-1l clf C': 0.001, 'decisiontreeclassifier max depth': 
1} 
0.967+/-0.05 {'pipeline-1l clf C': 0.1, 'decisiontreeclassifier max depth’: 1} 
1.,<000+7-0.00 {’pipéelane-1 cli C'r 100.0, "*decisitontreeclassifier max cepth": 
1} 
0.967+/-0.05 {'pipeline-1l clf C': 0.001, 'decisiontreeclassifier max depth': 
2} 
0.967+/-0.05 {'pipeline-1l clf C': 0.1, 'decisiontreeclassifier max depth’: 2} 
1.000+/-0.00 {'pipeline-1l clf C': 100.0, 'decisiontreeclassifier max depth': 
2} 
Pe PEIN’ Best. Parameters: «<s" —@ GrlG.best. params ) 
Pest Patanmeters. t”"pipeline-1 cle Ors 100.0, 
"CoOL POE eo oC laos te. fa Ceo. 4) 
PoP DEAN" RCCULaCy: set” © GlLiG.,best Score } 
RECuracy; 1.00 


As we can see, we get the best cross-validation results when we choose a lower regularization 
strength (Cc = 100.0) whereas the tree depth does not seem to affect the performance at all, suggesting 
that a decision stump 1s sufficient to separate the data. To remind ourselves that it is a bad practice to 
use the test dataset more than once for model evaluation, we are not going to estimate the 
generalization performance of the tuned hyperparameters in this section. We will move on swiftly to 
an alternative approach for ensemble learning: bagging. 


Note 


The majority vote approach we implemented in this section is sometimes also referred to as stacking. 
However, the stacking algorithm is more typically used in combination with a logistic regression 
model that predicts the final class label using the predictions of the individual classifiers in the 
ensemble as input, which has been described in more detail by David H. Wolpert in D. H. Wolpert. 
Stacked generalization. Neural networks, 5(2):241—259, 1992. 
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Bagging — building an ensemble of classifiers 
from bootstrap samples 


Bagging is an ensemble learning technique that is closely related to the MajorityVoteClassifier 
that we implemented 1n the previous section, as illustrated in the following diagram: 











Training set 
Bootstrap T. - 
samples et mM 
" 2 qT, Pa 
a { & rs < 
fat 2 
Classification C. C. Cc. . 
models , : n 3] 
| | aa 
t ¥ | 
Predictions P. p ee D 
+ 4 
Voting 
‘ 
Final prediction p 


However, instead of using the same training set to fit the individual classifiers 1n the ensemble, we 
draw bootstrap samples (random samples with replacement) from the initial training set, which is 
why bagging is also known as bootstrap aggregating. To provide a more concrete example of how 
bootstrapping works, let's consider the example shown in the following figure. Here, we have seven 
different training instances (denoted as indices 1-7) that are sampled randomly with replacement in 


each round of bagging. Each bootstrap sample 1s then used to fit a classifier My , which is most 
typically an unpruned decision tree: 
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Sample Bagging Bagging 
indices round 1 round 2 


a2 
i 
aa 
4 a 
so |7 
a ae 
a 





Bagging 1s also related to the random forest classifier that we introduced in Chapter 3, A Tour of 
Machine Learning Classifiers Using Scikit-learn. In fact, random forests are a special case of 
bagging where we also use random feature subsets to fit the individual decision trees. Bagging was 
first proposed by Leo Breiman in a technical report in 1994; he also showed that bagging can 
improve the accuracy of unstable models and decrease the degree of overfitting. I highly recommend 
you read about his research in L. Breiman. Bagging Predictors. Machine Learning, 24(2):123—140, 
1996, which 1s freely available online, to learn more about bagging. 


To see bagging 1n action, let's create a more complex classification problem using the Wine dataset 
that we introduced in Chapter 4, Building Good Training Sets — Data Preprocessing. Here, we will 
only consider the Wine classes 2 and 3, and we select two features: Alcohol and Hue. 


>>> import pandas as pd 
>>> df wine = pd.read csv('https://archive.ics.uci.edu/ml/machine-learning- 
databases/wine/wine.data', header=None) 
>> OF Wine.cCOolumns = |~Class label”; “ALCOnOL’, 
"Malic acid', ‘'Ash', 
TPilealinivy Of ash", 
'Magnesium', 'Total phenols', 
'Flavanoids', 'Nonflavanoid phenols', 
'Proanthocyanins*, 
"Color intensity', 'Hue', 
ote of diluted wines', 
'Proline' 


— df wine = df wine[df wine[ Sclagg eM SIIK a 1: 


Poe yy = Of wane |*Class label” |..values 
SS af Wane | | “Abcoho.", “Hue” |] ].values 


Next we encode the class labels into binary format and split the dataset into 60 percent training and 
40 percent test set, respectively: 


>>> from sklearn.preprocessing import LabelEncoder 
27r EEOMm Skea t hs CrOss Validation AMmpoLrt train test splat 
Per de = ia DeLEMCOGEe® {) 
poe YY = beetle Cransitormi(y) 
>>> X_ train, X_ test, y_ train, y test =\ 
train test split(X, y, 
Lest Ss17e=0.40, 
rancom sSstate=—1) 


A BaggingClassifier algorithm is already implemented in scikit-learn, which we can import from 
the ensemble submodule. Here, we will use an unpruned decision tree as the base classifier and 
create an ensemble of 500 decision trees fitted on different bootstrap samples of the training dataset: 


>>> from sklearn.ensemble import BaggingClassifier 


>>> tree = DecisionTreeClassifier(criterion='entropy', 
- aks max depth=None) 
sre Dag. = BagQgingclassiiveribase Gslimalor—tree, 


fi SStimators= 500, 

max samples=1.0, 

max. features=1.0, 
bootstrap=True, 

DOOTStrap PeaturescH-ralsce, 
i, “WOR e= i; 

rancom Stave= 1) 


Next we will calculate the accuracy score of the prediction on the training and test dataset to compare 
the performance of the bagging classifier to the performance of a single unpruned decision tree: 


o> > Lirom skiearm..melrics Import accuracy Score 

Peo LieSe = Lee. TiC (x Crain, Y Crain) 

Pre YY trai prec, = Lrec.pred1cl(x% train) 

Boe V Leck Died = Tel. Dp ed Ce Eeot) 

Pro Les Ural = accuracy SCore<y Crain, Y ttain pred) 
ver TCS Cesk. — eCeCuracy SCOrely Test, 7 Lec. pred) 

>>> print('Decision tree train/test accuracies %.3f/%.3f£' 
, ok © (Clee. Prat, Tiree test) ) 

Decision tree train/test accuracies 1.000/0.854 


Based on the accuracy values that we printed by executing the preceding code snippet, the unpruned 
decision tree predicts all class labels of the training samples correctly; however, the substantially 
lower test accuracy indicates high variance (overfitting) of the model: 


27s DAG = DaG«<TLU(x train, YY train) 

Per YY Tiara. prec: = Dag~predice(x% Craii) 

wo eee reo m= boo. Perea cE ty too) 

>>> bag train = accuracy score(y traWOMW! gBaolein pred) 
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vPro Dag Vest = acCCurecy SCcOrely tect, Y Lest. pred) 
>> DrIntt Bagging trainytest accuracies ~.3f/3. 38" 


O 


* (DaOQ. Urain, Dag Cest)) 
Saquuad train/test accuracies 1.000/0.896 


Although the training accuracies of the decision tree and bagging classifier are similar on the training 
set (both 1.0), we can see that the bagging classifier has a slightly better generalization performance 
as estimated on the test set. Next let's compare the decision regions between the decision tree and 
bagging classifier: 


>>> x min = X train|[ 0] () 
oer & Max = X Crain] s, OU) «max() = 
1] () 
1] 


>>> y min x train = 
>>> y max x. ee ey .max() + 
2 > ZX, VY = Np.meshorid(np.arange(x min, x max, U.l); 
cae NPpeearange(y Min, y max, 0.1). 
>oe Ey, axarr = plt.~subplots (nrows=l1, ncols=2, 
sharex='col', 
sharey='row' 
ee fFigsize=(8, 3)) 
2S ROM Wo, Cir, Be air Ziroc Oy bly 
[tree, bag], 
['Decision Tree', 'Bagging']): 
Slr. fii Cait, + ee) 


a PPP PB 


A = Clit .«preoiclu(mp.C [exetavel.(), yy.«tavel) |) 
Z= Z2.reshape(xx.shape) 
axa | dx | scCOntCOUrt (xXx, YV¥, “4, alpha=0.5) 
axatr (20x) «<sCatlter (xX trainly train==0, 01, 
X train[y train==0, 1], 
c='blue', marker='%*') 
axarre| 10x) ~SCatter(% train!|y train==1, Ul, 
x Crain) y Graimaal, Ld, 
c='red', marker='o"') 
oo axate | wox | seek Fite (EL) 
eer Oxarr 0 |.Sset yYlapel (ALeConol", f£onvsize—12) 
Por Plt. Texue(lo.Z, HleZ, 
s=Hue' 
ous ha='center', va='center', fontsize=12) 
>>> plt.show() 


As we can see in the resulting plot, the piece-wise linear decision boundary of the three-node deep 
decision tree looks smoother in the bagging ensemble: 
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Decision Tree Bagging 








We only looked at a very simple bagging example in this section. In practice, more complex 
classification tasks and datasets’ high dimensionality can easily lead to overfitting in single decision 
trees and this is where the bagging algorithm can really play out its strengths. Finally, we shall note 
that the bagging algorithm can be an effective approach to reduce the variance of a model. However, 
bagging 1s ineffective in reducing model bias, which 1s why we want to choose an ensemble of 
classifiers with low bias, for example, unpruned decision trees. 
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Leveraging weak learners via adaptive 
boosting 


In this section about ensemble methods, we will discuss boosting with a special focus on its most 
common implementation, AdaBoost (short for Adaptive Boosting). 


Note 


The original idea behind AdaBoost was formulated by Robert Schapire in 1990 (R. E. Schapire. The 
Strength of Weak Learnability. Machine learning, 5(2):197—227, 1990). After Robert Schapire and 
Yoav Freund presented the AdaBoost algorithm in the Proceedings of the Thirteenth International 
Conference (ICML 1996), AdaBoost became one of the most widely used ensemble methods 1n the 
years that followed (Y. Freund, R. E. Schapire, et al. Experiments with a New Boosting Algorithm. In 
ICML, volume 96, pages 148—156, 1996). In 2003, Freund and Schapire received the Goedel Prize 
for their groundbreaking work, which is a prestigious prize for the most outstanding publications 1n 
the computer science field. 


In boosting, the ensemble consists of very simple base classifiers, also often referred to as weak 
learners, that have only a slight performance advantage over random guessing. A typical example of a 
weak learner would be a decision tree stump. The key concept behind boosting 1s to focus on training 
samples that are hard to classify, that 1s, to let the weak learners subsequently learn from 
misclassified training samples to improve the performance of the ensemble. In contrast to bagging, the 
initial formulation of boosting, the algorithm uses random subsets of training samples drawn from the 
training dataset without replacement. The original boosting procedure 1s summarized 1n four key steps 
as follows: 


d 


1. Draw a random subset of training samples “! without replacement from the training set ? to 


train a weak learner C, 

2. Draw second random training subset a; without replacement from the training set and add 50 
percent of the samples that were previously misclassified to train a weak learner C 

3. Find the training samples d; in the training set ? on which C, and , disagree to train a third 
weak learner C 


4. Combine the weak learners C, C, , and C via majority voting. 


As discussed by Leo Breiman (L. Breiman. Bias, Variance, and Arcing Classifiers. 1996), boosting 
can lead to a decrease in bias as well as variance compared to bagging models. In practice, however, 
boosting algorithms such as AdaBoost are also known for their high variance, that is, the tendency to 
overfit the training data (G. Raetsch, T. Onoda, and K. R. Mueller. An Improvement of Adaboost to 
Avoid Overfitting. In Proc. of the Int. Conf, en Newtab faforgnation Processing. Citeseer, 1998). 


In contrast to the original boosting procedure as described here, AdaBoost uses the complete training 
set to train the weak learners where the training samples are reweighted in each iteration to build a 
strong classifier that learns from the mistakes of the previous weak learners 1n the ensemble. Before 
we dive deeper into the specific details of the AdaBoost algorithm, let's take a look at the following 
figure to get a better grasp of the basic concept behind AdaBoost: 





To walk through the AdaBoost illustration step by step, we start with subfigure 1, which represents a 
training set for binary classification where all training samples are assigned equal weights. Based on 
this training set, we train a decision stump (shown as a dashed line) that tries to classify the samples 
of the two classes (triangles and circles) as well as possible by minimizing the cost function (or the 
impurity score in the special case of decision tree ensembles). For the next round (subfigure 2), we 
assign a larger weight to the two previously misclassified samples (circles). Furthermore, we lower 
the weight of the correctly classified samples. The next decision stump will now be more focused on 
the training samples that have the largest weights, that is, the training samples that are supposedly 
hard to classify. The weak learner shown in subfigure 2 misclassifies three different samples from the 
circle-class, which are then assigned a larger weight as shown 1n subfigure 3. Assuming that our 
AdaBoost ensemble only consists of three rounds of boosting, we would then combine the three weak 
learners trained on different reweighted training subsets by a weighted majority vote, as shown 1n 
subfigure 4. 


Now that have a better understanding behind the basic concept of AdaBoost, let's take a more detailed 
look at the algorithm using pseudo code. For Clarity, we will denote element-wise multiplication by 
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the cross symbol (x) and the dot product between two vectors by a dot symbol (‘) , respectively. The 


steps are as follows: 


; > w=! 
1. Set weight vector ™ to uniform weights where Ze 


2. For inm boosting rounds, do the following: 


3. Train a weighted weak learner: ca train (XC, ). 
> = predict(C,,X 

4. Predict class labels: ~ ae | s ) 

5. Compute weighted error rate: sled" dea 2 . 





a, =0.5log — 
6. Compute coefficient: a. 


Ww = wxexp(—a, x yx y) 
7. Update weights: , te 


wows > w 
8. Normalize weights to sum to I: Le , 


§=(5" (a, xpredict(C,,X)) > 0} 


9. Compute final prediction: 


Note that the expression tae ) in step 5 refers to a vector of Is and Os, where a | 1s assigned if 
the prediction is correct and 0 is assigned otherwise. 


Although the AdaBoost algorithm seems to be pretty straightforward, let's walk through a more 
concrete example using a training set consisting of 10 training samples as illustrated 1n the following 
table: 


Weights ‘y(x<=3.0)? Correct? Updated 
weights 
| 0.072 
0.072 
| 0.072 
0.072 


| 0.072 
1 Ye ‘0.072 
a jon Yes 0.167 
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The first column of the table depicts the sample indices of the training samples | to 10. In the second 
column, we see the feature values of the individual samples assuming this 1s a one-dimensional 
dataset. The third column shows the true class label ** for each training sample “i where 
y.efl-l! iw. | | re | | 
a ! . The initial weights are shown in the fourth column; we initialize the weights to uniform 
and normalize them to sum to one. In the case of the 10 sample training set, we therefore assign the 
0.1 to each weight ": in the weight vector  . The predicted class labels Y are shown in the fifth 


column, assuming that our splitting criterion is * = 3.0 The last column of the table then shows the 
updated weights based on the update rules that we defined in the pseudocode. 


Since the computation of the weight updates may look a little bit complicated at first, we will now 
follow the calculation step by step. We start by computing the weighted error rate © as described 1n 
Step 5: 


€=0.1*x0+0.1%04+0.1*%040.1~0+0.1~%0+0.1%04+0.1~%04+0.1%0 


3 | 
+Q.1x0=—=9,3 
() 


Next we compute the coefficient bs (shown in step 6), which is later used 1n step 7 to update the 
weights as well as for the weights in majority vote prediction (step 10): 


_ 0.5log(1-«) 


& 


Ol = 0.424 


J 


._, ee 
After we have computed the coefficient ~’ we can now update the weight vector using the following 
equation: 


Ww = Wx exp(—a,, x yx y) 


Here, ’*- is an element-wise multiplication between the vectors of the predicted and true class 


labels, respectively. Thus, if a prediction ** is correct, ?'*?* will have a positive sign so that we 


ee. 7” 
decrease the ith weight since / 1s a positivevfitvileépals well: 
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0.1x exp(—0.424x1x1) = 0.066 


Similarly, we will downweight the ith weight if *’ predicted the label incorrectly like this: 


0.1 x exp(—0.424 x | x(—1)) =0.153 


Or like this: 
0.1x exp(—0.424«(-1)x(1)) + 0.153 


After we update each weight in the weight vector, we normalize the weights so that they sum up to 1 
(step 8): 


H’ 


> 
rr 





Ww = 


Sw, =7x 0.065 +3x0.153=0.914 


Here, 


Thus, each weight that corresponds to a correctly classified sample will be reduced from the initial 
value of 0.1 to 0.066 /0.914 = 0.072 for the next round of boosting. Similarly, the weights of each 
incorrectly classified sample will increase from 0.1 to 0.153/ 0.914 = 0.167 | 


This was AdaBoost in a nutshell. Skipping to the more practical part, let's now train an AdaBoost 
ensemble classifier via scikit-learn. We will use the same Wine subset that we used in the previous 
section to train the bagging meta-classifier. Via the base estimator attribute, we will train the 
AdaBoostClassifier on 500 decision tree stumps: 


>>> from sklearn.ensemble import AdaBoostClassifier 


>>> tree = DecisionTreeClassifier(criterion='entropy', 
aes Max OSptn=1) 
eer 20a, = ACanpoOosLC lLassiiier (base estimavor—ceice, 


n_estim@eny SBodk’ ’ 
LearyWwiiw6weBodk:drg 


: 3 random state=0) 

vo ieee = Cree. Li thei, +7 - ia) 

27> VF Crain, pred = Tieecpreci1 cl (x Crain) 

por  ©esl Pred = tree.predici Eese) 

Pro TSS Crain = -.aCCuracy score (y train, y train pred) 
ver Tree TeSt. = accuracy SeCOrely test, YY tesu pred) 

>>> print('Decision tree train/test accuracies %.3f/%.3f' 
- ¢ (Lree train, Cree test) ) 

Decision tree train/test accuracies 0.845/0.854 


As we can see, the decision tree stump seems to overfit the training data 1n contrast with the unpruned 
decision tree that we saw 1n the previous section: 


Po? Oe = acavlill(xX train, Y train) 

Pee Vy Crain Dree = acda.prediCce(x~ Erain) 

27 e VY Vest. Pred = a@da«predice(x% test) 

poe Ga hain. = 2cCuracy SCoOrety Train, 7. train pred) 
yor Goa, VeSU. = accuracy SCOre(y test, y Lest.pred) 


(e) 


>>> print('AdaBoost train/test accuracies %$.3f/%.3f' 


o (ada Train, ada. Test)’) 
AdaBoost train/vtese accuracies 1.000/70.875 


As we can see, the AdaBoost model predicts all class labels of the training set correctly and also 
shows a slightly improved test set performance compared to the decision tree stump. However, we 
also see that we introduced additional variance by our attempt to reduce the model bias. 


Although we used another simple example for demonstration purposes, we can see that the 
performance of the AdaBoost classifier is slightly improved compared to the decision stump and 
achieved very similar accuracy scores to the bagging classifier that we trained in the previous 
section. However, we should note that it is considered as bad practice to select a model based on the 
repeated usage of the test set. The estimate of the generalization performance may be too optimistic, 
which we discussed in more detail in Chapter 6, Learning Best Practices for Model Evaluation and 
Hyperparameter Tuning. 


Finally, let's check what the decision regions look like: 


>>> XX, yY = np.meshgrid(np.arange 
$6.8 np.arange 
oor Tf, exarr = DLU.SUDpDLOTS (1, 2, 
Ssharex='col', 
sharey='row' 
Sas fFigsize=(8, 3)) 
yoo FOr Ox, CLE, ‘Ee. tir Zip i ly, Ll; 
[tree, ada], 
['Decision Tree', 'AdaBoost']): 
Clistie(e train, Y tear) 


Z = clf.predict(np.c [xx.ravWOW eBask, ravel () ]) 
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x Min, * Max, Vil), 
VY.Miny, VY Mex, Del) ) 


Boe ee = 2%. era | OC) «iia = 
27> & Max = KA ttain |e, O].Mex() + 
eee YM = % Cran | 1].min() - 1 
Per YY Max = A Crain|:, L)]»smax() - 1 
( 
( 


Z= Z2.reshape(xx.shape) 
axa | dx | -COontourt (xXx, VV, “4, alpha=0.3) 
axatr [20x )] «SCatter(% trainly train==0, 0 
K UeCelaly Urea vl, J 
c="DpLue”, 
marker='%*') 
axarr | 10%) SCalter(% trainty train=—-L, 0! 
X trainly train==1, 1] 
c='red', 
marker='o') 
axatrr [10x] .sel Tatle (el) 
aexarr 0) .«seu-ylabel ("AlLconolL”, Tontsize—172) 
eo DPilt~text. (Lez, Blez, 
s=Hue', 
ha='center', 
Va="CenLer’, 
acs fontsize=12) 
>>> plt.show() 


By looking at the decision regions, we can see that the decision boundary of the AdaBoost model 1s 
substantially more complex than the decision boundary of the decision stump. In addition, we note that 
the AdaBoost model separates the feature space very similarly to the bagging classifier that we 
trained in the previous section. 


Decision Tree | AdaBoost 


2.9F 


2.0) 


1.5 


1.0 


Alcohol 


0.5 


0.0 





As concluding remarks about ensemble techniques, it 1s worth noting that ensemble learning increases 
the computational complexity compared to individual classifiers. In practice, we need to think 
carefully whether we want to pay the price of increased computational costs for an often relatively 
modest improvement of predictive performance. 


An often-cited example of this trade-off is the famous $/ Million Netflix Prize, which was won using 
ensemble techniques. The details about the algorithm were published in A. Toescher, M. Jahrer, and 
R. M. Bell. The Bigchaos Solution to the Netflix G, Grand Prize. Netflix prize documentation, 2009 
(which is available at http://www.stat.osuedw a dmasldCiranc Prize2009_BPC BigChaos.pdf). 





Although the winning team received the $1 million prize money, Netflix never implemented their 
model due to its complexity, which made it unfeasible for a real-world application. To quote their 
exact words (http: 





"[...J additional accuracy gains that we measured did not seem to justify the engineering 
effort needed to bring them into a production environment." 
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Summary 


In this chapter, we looked at some of the most popular and widely used techniques for ensemble 
learning. Ensemble methods combine different classification models to cancel out their individual 
weakness, which often results in stable and well-performing models that are very attractive for 
industrial applications as well as machine learning competitions. 


In the beginning of this chapter, we implemented a MajorityVoteClassifier in Python that allows 
us to combine different algorithm for classification. We then looked at bagging, a useful technique to 
reduce the variance of a model by drawing random bootstrap samples from the training set and 
combining the individually trained classifiers via majority vote. Then we discussed AdaBoost, which 
is an algorithm that 1s based on weak learners that subsequently learn from mistakes. 


Throughout the previous chapters, we discussed different learning algorithms, tuning, and evaluation 
techniques. In the following chapter, we will look at a particular application of machine learning, 
sentiment analysis, which has certainly become an interesting topic in the era of the Internet and 
social media. 
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Chapter 8. Applying Machine Learning to 
Sentiment Analysis 


In this Internet and social media time and age, people's opinions, reviews, and recommendations have 
become a valuable resource for political science and businesses. Thanks to modern technologies, we 
are now able to collect and analyze such data most efficiently. In this chapter, we will delve into a 
subfield of natural language processing (NLP) called sentiment analysis and learn how to use 
machine learning algorithms to classify documents based on their polarity: the attitude of the writer. 
The topics that we will cover in the following sections include: 


Cleaning and preparing text data 

Building feature vectors from text documents 

Training a machine learning model to classify positive and negative movie reviews 
Working with large text datasets using out-of-core learning 
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Obtaining the IMDb movie review dataset 


Sentiment analysis, sometimes also called opinion mining, is a popular sub-discipline of the broader 
field of NLP; it analyzes the polarity of documents. A popular task in sentiment analysis is the 
classification of documents based on the expressed opinions or emotions of the authors with regard to 
a particular topic. 


In this chapter, we will be working with a large dataset of movie reviews from the Internet Movie 
Database (IMDb) that has been collected by Maas et al. (A. L. Maas, R. E. Daly, P. T. Pham, D. 
Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. In the proceedings of 
the 49th Annual Meeting of the Association for Computational Linguistics: Human Language 
Technologies, pages 142—150, Portland, Oregon, USA, June 2011. Association for Computational 
Linguistics). The movie review dataset consists of 50,000 polar movie reviews that are labeled as 
either positive or negative; here, positive means that a movie was rated with more than six stars on 
IMDb, and negative means that a movie was rated with fewer than five stars on IMDb. In the 
following sections, we will learn how to extract meaningful information from a subset of these movie 
reviews to build a machine learning model that can predict whether a certain reviewer liked or 
disliked a movie. 


A compressed archive of the movie review dataset (84.1 MB) can be downloaded from 
http://a1.stanford.edu/~amaas/data/sentiment/ as a gzip-compressed tarball archive: 


e Ifyou are working with Linux or Mac OS X, you can open a new terminal window, use cd to go 
into the download directory, and execute tar -zxf aclimdb vl.tar.gz to decompress the 
dataset 

e If you are working with Windows, you can download a free archiver such as 7-Zip 
(http://www.7-zip.org) to extract the files from the download archive 


Having successfully extracted the dataset, we will now assemble the individual text documents from 
the decompressed download archive into a single CSV file. In the following code section, we will be 
reading the movie reviews into a pandas DataFrame object, which can take up to 10 minutes ona 
standard desktop computer. To visualize the progress and estimated time until completion, we will 
use the PyPrind (Python Progress Indicator, https://pyp1.python.or i/PyPrind/) package that I 
developed several years ago for such purposes. PyPrind can be installed by executing the command: 
pip install pyprind. 


Poo AMpPOrt. pyprand 

2o> IMPOre. Pendas as pd 

27 AMDOrL-“OS 

Poor POar = Pypraind. ProgBar 150000) 
>>> labels = {'pos':1, 'neg':0} 
>>> df = pd.DataFrame () 


ao > TOY Ss am ("test"*, *treinm’) = 
FOr 1 am ("*pos*, "neg'): 
path ='./aclimdb/%s/%s =WwOlW! ebdok 


for file in os.listdir (Rat wbwebook.org 


with open(os.path.join(path, file), ‘'r') as infile: 
txt = infile.read() 
of = Of«eappena(Litxt, Labels (Ll, Bonore 1ndex=1rUe) 
¢ 4-5 pboar.update () 
>>> Or. columns — ["T,eview', "sentiment" | 
O% 100% 
Lette HEE TH FEET EE EEE FEE FEE EEE EEE] O| ETALSec]: 0.000 
Total time elapsed: 725.001 sec 


Executing the preceding code, we first initialized a new progress bar object pbar with 50,000 
iterations, which is the number of documents we were going to read 1n. Using the nested for loops, 
we iterated over the train and test subdirectories in the main aclImdb directory and read the 
individual text files from the pos and neg subdirectories that we eventually appended to the 
DataFrame df—together with an integer class label (1 = positive and 0 = negative). 


Since the class labels in the assembled dataset are sorted, we will now shuffle DataFrame using the 
permutation function from the np. random submodule—this will be useful to split the dataset into 
training and test sets in later sections when we will stream the data from our local drive directly. For 
our own convenience, we will also store the assembled and shuffled movie review dataset as a CSV 
file: 


Poo AMDOrL MUMpy as- Hp 

>>> np.random.seed (0) 

>>> df = df.reindex(np.random.permutation(df.index) ) 
>>> dfi.to csv('./movie data.csv', index=False) 


Since we are going to use this dataset later 1n this chapter, let us quickly confirm that we successfully 
saved the data in the right format by reading 1n the CSV and printing an excerpt of the first three 
samples: 


>>> df = pd.read_ csv('./movie data.csv') 
ye OT eiWead (3) 


If you are running the code examples in [Python Notebook, you should now see the first three samples 
of the dataset, as shown in the following table: 
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Introducing the bag-of-words model 


We remember from Chapter 4, Building Good Training Sets — Data Preprocessing, that we have to 
convert categorical data, such as text or words, into a numerical form before we can pass it on to a 
machine learning algorithm. In this section, we will introduce the bag-of-words model that allows us 
to represent text as numerical feature vectors. The idea behind the bag-of-words model 1s quite 
simple and can be summarized as follows: 


1. We create a vocabulary of unique tokens—for example, words—from the entire set of 
documents. 

2. We construct a feature vector from each document that contains the counts of how often each 
word occurs 1n the particular document. 


Since the unique words in each document represent only a small subset of all the words in the bag-of- 
words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them 
sparse. Do not worry if this sounds too abstract; in the following subsections, we will walk through 
the process of creating a simple bag-of-words model step-by-step. 


WOW! eBook 
www.wowebook.org 


Transforming words into feature vectors 


To construct a bag-of-words model based on the word counts in the respective documents, we can use 
the CcountVectorizer Class implemented in scikit-learn. As we will see in the following code 
section, the CcountVectorizer Class takes an array of text data, which can be documents or just 
sentences, and constructs the bag-of-words model for us: 


>>> import numpy as np 
27 EPO, Sk lCaths Peacure Gxt race Olisbexe. 1mMporr County eceror Zer 
>>> count = CountVectorizer () 
Per OOCS = Nis array (| 
'The sun is shining', 
'The weather 1S sweet', 
a ars 'The sun 1S shining and the weather is sweet']) 
2 Dag = Count. ie Transtorm (docs) 


By calling the fit transform method on CountVectorizer, we Just constructed the vocabulary of 
the bag-of-words model and transformed the following three sentences into sparse feature vectors: 


l. The sun is Shining 
2. The weather is sweet 


3. The sun is Shining and the weather is sweet 


Now let us print the contents of the vocabulary to get a better understanding of the underlying 
concepts: 


27> PINE Count. Vocabulary ) 
{'the': 5, ‘'shining': 2, 'weather': 6, ‘sunt: 3, ‘'is't: 1, 'sweet': 4, 'and': O} 


As we can See from executing the preceding command, the vocabulary is stored in a Python 
dictionary, which maps the unique words that are mapped to integer indices. Next let us print the 
feature vectors that we just created: 


.toarray()) 
0 |] 
1 | 
1] | 


Each index position in the feature vectors shown here corresponds to the integer values that are stored 
as dictionary items in the countVectorizer vocabulary. For example, the first feature at index 
position 0 resembles the count of the word and, which only occurs 1n the last document, and the word 
is at index position 1 (the 2nd feature in the document vectors) occurs in all three sentences. Those 
values in the feature vectors are also called the raw term frequencies: ¢/ (t,d)—the number of times 
a term? occurs 1n a document d. 


Note 


The sequence of items in the bag-of- words matey thi? We just created is also called the 1-gram or 


unigram model—each item or token in the vocabulary represents a single word. More generally, the 
contiguous sequences of items in NLP—words, letters, or symbols—s also called an n-gram. The 
choice of the number 7 in the n-gram model depends on the particular application; for example, a 
study by Kanaris et al. revealed that n-grams of size 3 and 4 yield good performances in anti-spam 
filtering of e-mail messages (Ioannis Kanaris, Konstantinos Kanaris, Ioannis Houvardas, and 
Efstathios Stamatatos. Words vs Character N-Grams for Anti-Spam Filtering. International Journal 
on Artificial Intelligence Tools, 16(06):1047—1067, 2007). To summarize the concept of the n-gram 
representation, the 1-gram and 2-gram representations of our first document "the sun 1s shining" 
would be constructed as follows: 


mo ome 


e 1-gram: "the", "sun", "is", "shining" 


e 2-gram: "the sun", "sunis", "is shining" 
The countVectorizer Class in scikit-learn allows us to use different n-gram models via its 
ngram range parameter. While a l1-gram representation 1s used by default, we could switch to a 2- 
gram representation by initializing a new CountVectorizer Instance withngram range=(2,2). 
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Assessing word relevancy via term frequency-inverse 
document frequency 


When we are analyzing text data, we often encounter words that occur across multiple documents 
from both classes. Those frequently occurring words typically don't contain useful or discriminatory 
information. In this subsection, we will learn about a useful technique called term fre quency- 
inverse document frequency (tf-idf) that can be used to downweight those frequently occurring 
words in the feature vectors. The tf-idf can be defined as the product of the term frequency and the 
inverse document frequency: 


iidE (td) = of (td) id (4. 


Here the ¢/(¢, d) 1s the term frequency that we introduced 1n the previous section, and the inverse 
document frequency idf(t, d) can be calculated as: 


Ms 


idf (t,d) = 8 Tat (dt) 


where " is the total number of documents, and df(d, t) 1s the number of documents d that contain the 
term t. Note that adding the constant 1 to the denominator 1s optional and serves the purpose of 
assigning a non-zero value to terms that occur in all training samples; the log is used to ensure that 
low document frequencies are not given too much weight. 


Scikit-learn implements yet another transformer, the TfidfTransformer, that takes the raw term 
frequencies from CountVectorizer as input and transforms them into tf-idfs: 


27> LPOM SKieatis eacure extract ionwtext amport Vi1ariranstormer 
>>> tfidf = TfidfTransformer () 
Pee NDw.SeCl PrLinLoOpelons (pLeCcis10n=Z) 


27> DIEING (CEMOrei i Translorm(CoUunt,.f410 trans torm (docs) ).coarray () ) 
[[ O. Uewts UsS60 Uso6. UO. 0.43 0 ] 

[. “0. 0343. CG. Ol. 0.56 0.43 0.56] 

[ 0.4 Oe40 U.31 O23. Usol 0.46 U.31)] 


AS we Saw in the previous subsection, the word is had the largest term frequency in the 3rd 
document, being the most frequently occurring word. However, after transforming the same feature 
vector into tf-idfs, we see that the word is is now associated with a relatively small tf-1df (0.31) in 


document 3 since it is also contained in docurtti$ dpand 2 2 and thus is unlikely to contain any useful, 
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discriminatory information. 


However, if we'd manually calculated the tf-idfs of the individual terms 1n our feature vectors, we'd 
have noticed that the TfidfTransformer calculates the tf-idfs slightly differently compared to the 
standard textbook equations that we defined earlier. The equations for the idf and tf-idf that were 
implemented 1n scikit-learn are: 


| + nN, 


a¢(id\=-loe —_ 
ost at late. 1+ df (d,t) 


The tf-1df equation that was implemented 1n scikit-learn is as follows: 


tf-idf (t,d) = ¢f (td) x(idf (td) +1) 


While it is also more typical to normalize the raw term frequencies before calculating the tf-idfs, the 
T£idfTransformer normalizes the tf-idfs directly. By default (norm='12'), scikit-learn's 
T£idfTransformer applies the L2-normalization, which returns a vector of length | by dividing an 
un-normalized feature vector v by its L2-norm: 


} =e ee eee 
Pr 


| > > + , Lit 
1 “ Vv + Vv. + eras + Vy | = iY, 
Z Ps tt a : | 
! Px: 











To make sure that we understand how TfidfTransformer works, let us walk through an example and 
calculate the tf-idf of the word is in the 3rd document. 


The word is has a term frequency of 2 (tf = 2) in document 3, and the document frequency of this term 
is 3 since the term is occurs 1n all three documents (df= 3). Thus, we can calculate the idf as 
follows: 


bes _ 
l+3 


t 





idf ("is",d3) =log 
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Now 1n order to calculate the tf-1df, we simply need to add 1 to the inverse document frequency and 
multiply it by the term frequency: 


tf-idf ("is",d3) =2x(0+1)=2 


If we repeated these calculations for all terms in the 3rd document, we'd obtain the following tf-idf 
vectors: [1.69, 2.00, 1.29, 1.29, 1.29, 2.00, and 1.29]. However, we notice that the values in this 
feature vector are different from the values that we obtained from the TfidfTransformer that we 
used previously. The final step that we are missing 1n this tf-idf calculation 1s the L2-normalization, 
which can be applied as follows: 


pciees 11.69, 2.00, 1.29, 1.29, 1.29, 2.00, 1.29] 
tf-idf ("is",d3) : 


* PRT 





1.69". 200° +1.297 +1.29° +:1.29° +:2.00° + 1.29° 


=|0.40, 0.48, 0.31, 0.31, 0.31, 0.48, 0.31] 


As we can see, the results now match the results returned by scikit-learn's TfidfTransformer. Since 
we now understand how tf-idfs are calculated, let us proceed to the next sections and apply those 
concepts to the movie review dataset. 
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Cleaning text data 


In the previous subsections, we learned about the bag-of-words model, term frequencies, and tf-idfs. 
However, the first important step—before we build our bag-of-words model—is to clean the text 
data by stripping it of all unwanted characters. To illustrate why this is important, let us display the 
last 50 characters from the first document 1n the reshuffled movie review dataset: 


Pee OFs,LOCl0, *review’:| [=o0¢ | 
f'a6. Seven.<br /o<br s/eTitle (Brazitd): NOE. Avaliable’ 


As we can see here, the text contains HTML markup as well as punctuation and other non-letter 
characters. While HTML markup does not contain much useful semantics, punctuation marks can 
represent useful, additional information in certain NLP contexts. However, for simplicity, we will 
now remove all punctuation marks but only keep emoticon characters such as ":)" since those are 
certainly useful for sentiment analysis. To accomplish this task, we will use Python's regular 
expression (regex) library, re, as shown here: 


>>> import re 
Por CEL OVepLOCessor (text): 


text = re.sub('<[%>]*>', '', text) 

emoticons = re.findall('(?::/; ]/=) (?:-)?¢(?:\)/\C/DIP)', text) 
. text = re.sub('[\W]+', ' ', text.lower()) + \ 
',jyoin(emoticons).replace('-', '') 


return text 


Via the first regex <[*>]*> 1n the preceding code section, we tried to remove the entire HTML 
markup that was contained in the movie reviews. Although many programmers generally advise 
against the use of regex to parse HTML, this regex should be sufficient to clean this particular 
dataset. After we removed the HTML markup, we used a slightly more complex regex to find 
emoticons, which we temporarily stored as emoticons. Next we removed all non-word characters 
from the text via the regex [\w]+, converted the text into lowercase characters, and eventually added 
the temporarily stored emoticons to the end of the processed document string. Additionally, we 
removed the nose character (-) from the emoticons for consistency. 


Note 


Although regular expressions offer an efficient and convenient approach to searching for characters in 
a string, they also come with a steep learning curve. Unfortunately, an in-depth discussion of regular 
expressions is beyond the scope of this book. However, you can find a great tutorial on the Google 


Developers portal at https://developers.google.com/edu/python/regular-expressions or check out the 


official documentation of Python's re module at https://docs.python.org/3.4/library/re. html. 


Although the addition of the emoticon characters to the end of the cleaned document strings may not 
look like the most elegant approach, the order of the words doesn't matter in our bag-of-words model 
if our vocabulary only consists of 1-word tokens. But before we talk more about splitting documents 


into individual terms, words, or tokens, let us gepfizrpa¢kat our preprocessor works correctly: 
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>>> preprocessor(df.loc[0O, ‘review'][-50:]) 

"is seven title brazil not available' 

>>> preprocessor("</a>This :) is :( a test :-)!") 
'ChlsS 2S a best =) ¢( s)! 


Lastly, since we will make use of the cleaned text data over and over again during the next sections, 
let us now apply our preprocessor function to all movie reviews 1n our DataFrame: 


>>> df['review'] = df['review'].apply (preprocessor) 
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Processing documents into tokens 


Having successfully prepared the movie review dataset, we now need to think about how to split the 
text corpora into individual elements. One way to tokenize documents is to split them into individual 
words by splitting the cleaned document at its whitespace characters: 


>>> def tokenizer (text): 

return text.split() 
>>> tokenizer('runners like running and thus they run') 
['runners', ‘'like', ‘running', ‘Tand', 'thus', 'they', ‘'run'] 


In the context of tokenization, another useful technique 1s word stemming, which is the process of 
transforming a word into its root form that allows us to map related words to the same stem. The 
original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the 
Porter stemmer algorithm (Martin F. Porter. An algorithm for suffix stripping. Program: electronic 
library and information systems, 14(3):130—137, 1980). The Natural Language Toolkit for Python 
(NLIK, http://www.nltk.org) implements the Porter stemming algorithm, which we will use in the 
following code section. In order to install the NLTK, you can simply execute pip install nltk. 


>>> from nltk.stem.porter import PorterStemmer 

>>> porter = Porterstemmer () 

27 COST COkSNLZer Porter (Text) = 

: return [porter.stem(word) for word in text.split()] 

2P> TCOKENLZEr pOorter(” runners Iike Punning and. thus they z2un”) 
[**runner*, *liket, *ruirt, “and”, *thu", they"; “run. | 


Note 


Although NLTK is not the focus of the chapter, I highly recommend you to visit the NLTK website as 
well as the official NLTK book, which is freely available at http://www.nltk.org/book/, if you are 
interested in more advanced applications in NLP. 


Using PorterStemmer from the nltk package, we modified our tokenizer function to reduce words 
to their root form, which was illustrated by the previous simple example where the word running 
was stemmed to its root form run. 


Note 


The Porter stemming algorithm is probably the oldest and simplest stemming algorithm. Other popular 
stemming algorithms include the newer Snowball stemmer (Porter2 or "English" stemmer) or the 
Lancaster stemmer (Paice-Husk stemmer), which 1s faster but also more aggressive than the Porter 
stemmer. Those alternative stemming algorithms are also available through the NLTK package 


(http://www.nltk.org/api/nltk.stem.html). 


While stemming can create non-real words, such as thu, (from thus) as shown in the previous 
example, a technique called lemmatization aims to obtain the canonical (grammatically correct) 


forms of individual words—the so-called lenymas.ckhowwvever, lemmatization 1s computationally more 
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difficult and expensive compared to stemming and, in practice, it has been observed that stemming 
and lemmatization have little impact on the performance of text classification (Michal Toman, Roman 
Tesar, and Karel Jezek. Influence of word normalization on text classification. Proceedings of 
InSciT, pages 354-358, 2006). 


Before we jump into the next section where will train a machine learning model using the bag-of- 
words model, let us briefly talk about another useful topic called stop-word removal. Stop-words 
are simply those words that are extremely common in all sorts of texts and likely bear no (or only 
little) useful information that can be used to distinguish between different classes of documents. 
Examples of stop-words are is, and, has, and the like. Removing stop-words can be useful 1f we are 
working with raw or normalized term frequencies rather than tf-idfs, which are already 
downweighting frequently occurring words. 


In order to remove stop-words from the movie reviews, we will use the set of 127 English stop- 
words that 1s available from the NLTK library, which can be obtained by calling the nitk.download 
function: 


>>> AMpPOre. NLTK 
2oo NLCK. down. Oadt"StOpwords*) 


After we have downloaded the stop-words set, we can load and apply the English stop-word set as 
follows: 


>>> from nltk.corpus import stopwords 

>>> stop = stopwords.words('english') 

> [Ww TOL W am. TOKehEZer porteri( a tunter Lakes runniag and, £uns a Lot’) |=le 
Lt W MOG an Stop] 


P72 r’>, “Like, "sar, “Tun. + voce i 


WOW! eBook 
www.wowebook.org 


Training a logistic regression model for 
document classification 


In this section, we will train a logistic regression model to classify the movie reviews into positive 
and negative reviews. First, we will divide the DataFrame of cleaned text documents into 25,000 
documents for training and 25,000 documents for testing: 


27> & Traian = Of .LOC lL: 250000, “review |..values 
PoP yy trast = Ch. Loclizzo000, “Sentiment” | «values 
te oe eee = Ole lOC| 2 Ula, “Levis |.VvaLGes 
227 ¥ Cese = Gt, 10C|( 250002, “Sentiment.” | «values 


Next we will use a GridSearchcv object to find the optimal set of parameters for our logistic 
regression model using 5-fold stratified cross-validation: 


eer TOM, Skea. Grid. Search amport Gridoearency 

>>> from sklearn.pipeline import Pipeline 

yee ETOM SkICatnalinecar Model import bogistichegression 

27 LEON SKICarNareaeure Sxtractionslexe. TMpPOLrL. Tr1orVectorizer: 

por TELL = Tier Veccorizer (esti). 2ecents—None, 
LOwetCasec—Palse, 

are preprocessor=None) 

a7 Ppetom Gee = I "Veco moran tance: = ‘titi, 


"VECUL .. SLOP words’: [stop, None], 
"Veco tokenizer’: [LOkentZer, 
LOKENIZer porter], 
"Cle. Detaley Ss rey “A ly 
rire Wee kel, Bee LO. iy 
i VeCe. NGtam range’ = (ty) I 
‘VeCU. SLOP words < (Stop, None] ; 
‘VOCC LOKenIZer” = (LOKGnLZer, 
LOKenIZer porter), 
‘VeCcr, Woe tor tra loel, 
"VeCe. mOrm None], 
“Clr -.Detaliy 2 bday LZ he 
GE se ikely dedey LOO Uhl 
ee 
27 ike Ler = Pipeline (i vec”, Tit ait). 
(ei , 


a HOGLSTICGREGress10On (random state=—0)) |,) 
eer Gs Ja Ci10r = CrioSeer cic (1 ti1of, Param grid, 
scoring='accuracy', 
Cv=5, verbose=l, 
<3 A. JO0e=—) 
por Ge tie Piro, 1 teen, eta 


When we initialized the GridSearchcv object and its parameter grid using the preceding code, we 
restricted ourselves to a limited number of parameter combinations since the number of feature 


vectors, as well as the large vocabulary, can make, the grid search computationally quite expensive; 
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using a standard Desktop computer, our grid search may take up to 40 minutes to complete. 


In the previous code example, we replaced the countVectorizer and TfidfTransformer from the 
previous subsection with the TfidfVectorizer, which combines the latter transformer objects. Our 
param grid consisted of two parameter dictionaries. In the first dictionary, we used the 
T£idfVectorizer with its default settings (use idf=True, smooth idf=True, and norm='12') to 
calculate the tf-idfs; in the second dictionary, we set those parameters to use idf=False, 

smooth idf=False, and norm=None in order to train a model based on raw term frequencies. 
Furthermore, for the logistic regression classifier itself, we trained models using L2 and L1 
regularization via the penalty parameter and compared different regularization strengths by defining a 
range of values for the inverse-regularization parameter c. 


After the grid search has finished, we can print the best parameter set: 


(e) 


Poo DIAG’ BESt. Patameter Sel; «Ss * «@ Gs JF Liteor. best params ) 

Best Darlameter set. 7.Clit. ©'s 1020, “Vect stop Worde -. None, “cli  Ppenalvy”: 
Ay “VECt  “COkKeniZer” > <funct1on. tokenizer at Ox /ToCtU4940Cce>, 

"MeCE MGram range *< (1, iyi 


As we can see here, we obtained the best grid search results using the regular tokenizer without 
Porter stemming, no stop-word library, and tf-idfs in combination with a logistic regression classifier 
that uses L2 regularization with the regularization strength c=10.0. 


Using the best model from this grid search, let us print the 5-fold cross-validation accuracy scores on 
the training set and the classification accuracy on the test dataset: 


Sy Drtoet GY PeCuUracCy. 2.1505" 


oO 


so os Os 11 CitCisbest_ score.) 
CY ACCUTacCy: 0.097 

Fer Cll = Os th triCr.best Serimecor |. 
>>> print('Test Accuracy: %.3f' 

ee e Cli.SCOre (xX vest, Y test) ) 
Test ACCuracy: 0.399 


The results reveal that our machine learning model can predict whether a movie review 1s positive or 
negative with 90 percent accuracy. 


Note 


A still very popular classifier for text classification 1s the Naive Bayes classifier, which gained 
popularity in applications of e-mail spam filtering. Naive Bayes classifiers are easy to implement, 
computationally efficient, and tend to perform particularly well on relatively small datasets compared 
to other algorithms. Although we don't discuss Naive Bayes classifiers in this book, the interested 
reader can find my article about Naive Text classification that I made freely available on arXiv (S. 
Raschka. Naive Bayes and Text Classification I - introduction and Theory. Computing Research 


Repository (CoRR), abs/1410.5329, 2014. http://arxiv.org/pdf/1410.5329v3.pdf). 
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Working with bigger data — online algorithms 
and out-of-core learning 


If you executed the code examples in the previous section, you may have noticed that it could be 
computationally quite expensive to construct the feature vectors for the 50,000 movie review dataset 
during grid search. In many real-world applications it is not uncommon to work with even larger 
datasets that may even exceed our computer's memory. Since not everyone has access to 
supercomputer facilities, we will now apply a technique called out-of-core learning that allows us to 
work with such large datasets. 


Back in Chapter 2, Training Machine Learning Algorithms for Classification, we introduced the 
concept of stochastic gradient descent, which is an optimization algorithm that updates the model's 
weights using one sample at a time. In this section, we will make use of the partial fit function of 
the scpClassifier inscikit-learn to stream the documents directly from our local drive and train a 
logistic regression model using small minibatches of documents. 


First, we define a tokenizer function that cleans the unprocessed text data from our 
movie data.csv file that we constructed in the beginning of this chapter and separates it into word 
tokens while removing stop words. 


27> MMNDOrL NUMDY asS- Tip 
>>> import re 
aor LTrom NLtk.COrpus Import SLOpwords 


>>> stop = stopwords.words('english') 
>>> def tokenizer(text): 
text = re.sub('<[%*%>]*>', '', text) 
emoticons = re.findall('(?:: |; ]/=) (?:-)? (?:\) I\CIDIB)', 
text.lower () ) 
text = re.sub('[\W]+', ' ', text.lower({)) \ 
+ ' 1 .7O1n (emotsCcons).replace(’=', %*'") 
tokenized = [w for w in text.split() if w not in stop] 


return tokenized 


Next we define a generator function, stream docs, that reads in and returns one document at a time: 


mee OC tee kealr OCS ipa) = 
with open(path, ‘'r') as csv: 
next(csv) # skip header 
for line in csv: 
text, label = line[:-3], int(line[-2]) 
yield text, label 


To verify that our stream docs function works correctly, let us read in the first document from the 
movie data.csv file, which should return a tuple consisting of the review text as well as the 
corresponding class label: 


>>> next(stream docs (path='./movie dd Pat, SBI | 
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('"In 1974, the teenager Martha Moxley ... ',1) 


We will now define a function, get_minibatch, that will take a document stream from the 
stream docs function and return a particular number of documents specified by the size parameter: 


eer OCr Ger Minibevlen (coc surecem, S176): 
docs, y = [], [1] 
Leys 
fOr . in Yange (size) 
text, label = Mexticoc stream) 
docs.append (text) 
y.append(label) 
except StoplIteration: 
return None, None 
recur COoCce, 7 


Unfortunately, we can't use the countVectorizer for out-of-core learning since it requires holding 
the complete vocabulary in memory. Also, the TfidfVectorizer needs to keep the all feature 
vectors of the training dataset in memory to calculate the inverse document frequencies. However, 
another useful vectorizer for text processing implemented 1n scikit-learn is HashingVectorizer. 
HashingVectorizer 1s data-independent and makes use of the Hashing trick via the 32-bit 


MurmurHash3 algorithm by Austin Appleby (https://sites.google.com/site/murmurhash/). 


por Lrom Sk ikearn.Teavure extract 1O0m.Cexe Import HashiamoVeclorizer 
eo ELOM Skicari.lineer mooel, 1mpert SGC lassi t1e7 
2A? MOCt, = Hasningvectorizer(cSecoge error="1gnore”; 
A Beatireo=2-" 7 |, 
PDECPLOCessor=None, 
soos tokenizer=tokenizer) 
por Cll = SEDC LacolEMer(Oco—" log’, Pancom Stete=l, 2. 1eee—)) 
>>> doc stream = stream docs (path='./movie data.csv') 


Using the preceding code, we initialized HashingVectorizer with our tokenizer function and set 


the number of features to 2”. Furthermore, we reinitialized a logistic regression classifier by setting 
the Loss parameter of the SGDClassifier to 1og—note that, by choosing a large number of features 
in the HashingVectorizer, we reduce the chance to cause hash collisions but we also increase the 
number of coefficients 1n our logistic regression model. 


Now comes the really interesting part. Having set up all the complementary functions, we can now 
start the out-of-core learning using the following code: 


Po AMDPOLE. pPyDLE ind 
>>> phar = pyprind.ProgBar (45) 
>>> classes = np.array([0, 1]) 
Poe tO | wi tange (45). 
A tirein, Y Crain = Get Mani bavcn(aec stream, srze-—1000) 
i. NOt. A) tigen: 
break 
x CUiain = Vect.transtorm(x Crain) 
clf.partial fit(X train, y tWwoweBédlesses=classes) 
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se pbhar.update() 

0% 100% 

(HHT TTTEETTTT ETH aE Tae] O6©| dh ETA[TSec]: 0.000 
Total time elapsed: 50.063 sec 


Again, we made use of the PyPrind package in order to estimate the progress of our learning 
algorithm. We initialized the progress bar object with 45 iterations and, in the following for loop, we 
iterated over 45 minibatches of documents where each minibatch consists of 1,000 documents each. 


Having completed the incremental learning process, we will use the last 5,000 documents to evaluate 
the performance of our model: 


277 Kk COSt, Y TeSt = Ger MainioalcChicoc Stream, Si1z7e=5000) 
Po? &. Leste. = VSECLulrenscrorit% Les) 

por PEI Cl ACCULraCy: c.ol" ~~ ClissCOre(™ ESsl, Y bese) 
PROCUPACY 2: Ueoos 


As we can see, the accuracy of the model 1s 87 percent, slightly below the accuracy that we achieved 
in the previous section using the grid search for hyperparameter tuning. However, out-of-core 
learning 1s very memory-efficient and took less than a minute to complete. Finally, we can use the last 
5,000 documents to update our model: 


geo Gian = Cliepareiak Tie tesc, Yi reakr) 


If you are planning to continue directly with Chapter 9, Embedding a Machine Learning Model into 
a Web Application, | recommend you to keep the current Python session open. In the next chapter, will 
use the model that we just trained to learn how to save it to disk for later use and embed it into a web 
application. 


Note 


Although the bag-of-words model 1s still the most commonly used model for text classification, it 
does not consider sentence structure and grammar. A popular extension of the bag-of-words model is 
Latent Dirichlet allocation, which is a topic model that considers the latent semantics of words (D. 
M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. The Journal of machine Learning 
research, 3:993—1022, 2003). 


A more modern alternative to the bag-of-words model is word2vec, an algorithm that Google 
released in 2013 (T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word 
Representations in Vector Space. arXiv preprint arX1v:1301.3781, 2013). The word2vec algorithm 
is an unsupervised learning algorithm based on neural networks that attempts to automatically learn 
the relationship between words. The idea behind word2vec 1s to put words that have similar 
meanings into similar clusters; via clever vector-spacing, the model can reproduce certain words 
using simple vector math, for example, king — man + woman = queen. 


The original C-implementation, with useful links to the relevant papers and alternative 


implementations, can be found at https://code.geggla.cemn/p/word2vec/. 
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Summary 


In this chapter, we learned how to use machine learning algorithms to classify text documents based 
on their polarity, which is a basic task in sentiment analysis in the field of natural language 
processing. Not only did we learn how to encode a document as a feature vector using the bag-of- 
words model, but we also learned how to weight the term frequency by relevance using term 
frequency-inverse document frequency. 


Working with text data can be computationally quite expensive due to the large feature vectors that are 
created during this process; 1n the last section, we learned how to utilize out-of-core or incremental 
learning to train a machine learning algorithm without loading the whole dataset into a computer's 
memory. 


In the next chapter, we will use our document classifier and learn how to embed it into a web 
application. 
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Chapter 9. Embedding a Machine Learning 
Model into a Web Application 


In the previous chapters, you learned about the many different machine learning concepts and 
algorithms that can help us with better and more efficient decision-making. However, machine 
learning techniques are not limited to offline applications and analyses, and they can be the predictive 
engine of your web services. For example, popular and useful applications of machine learning 
models in web applications include spam detection in submission forms, search engines, 
recommendation systems for media or shopping portals, and many more. 


In this chapter, you will learn how to embed a machine learning model into a web application that can 
not only classify but also learn from data in real-time. The topics that we will cover are as follows: 


Saving the current state of a trained machine learning model 

Using SQLite databases for data storage 

Developing a web application using the popular Flask web framework 
Deploying a machine learning application to a public web server 
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Serializing fitted scikit-learn estimators 


Training a machine learning model can be computationally quite expensive, as we have seen 1n 
Chapter 8, Applying Machine Learning to Sentiment Analysis. Surely, we don't want to train our 
model every time we close our Python interpreter and want to make a new prediction or reload our 
web application? One option for model persistence 1s Python's in-built pickle module 
(https://docs.python.org/3.4/library/pickle.html), which allows us to serialize and de-serialize Python 
object structures to compact byte code, so that we can save our classifier 1n its current state and 
reload it if we want to classify new samples without needing to learn the model from the training data 
all over again. Before you execute the following code, please make sure that you have trained the out- 
of-core logistic regression model from the last section of Chapter 8, Applying Machine Learning to 
Sentiment Analysis, and have it ready in your current Python session: 


>>> import pickle 
Sor IMPOR. -OS 
por OSs, = Os.paetha O1m( "MOV LeCclassificr’, “pk Objects”) 
>>> if not os.path.exists (dest): 
er os.makedirs (dest) 
eer DicklLe.dump (Stop, 
open(os.path.join(dest, 'stopwords.pkl'), 'wb'), 
8 protocol=4) 
>>> picklbe.cump (clr, 
open(os.path.join(dest, 'classifier.pkl'), '‘'wb'), 
protocol=4) 


Using the preceding code, we created a movieclassifier directory where we will later store the 
files and data for our web application. Within this movieclassifier directory, we created a 

pkl_ objects subdirectory to save the serialized Python objects to our local drive. Via pickle's dump 
method, we then serialized the trained logistic regression model as well as the stop word set from the 
NLTIK library so that we don't have to install the NLTK vocabulary on our server. The dump method 
takes as its first argument the object that we want to pickle, and for the second argument we provided 
an open file object that the Python object will be written to. Via the wb argument inside the open 
function, we opened the file in binary mode for pickle, and we set protocol1=4 to choose the latest 
and most efficient pickle protocol that has been added to Python 3.4. (If you have problems using 
protocol 4, please check if you are using the latest Python 3 version install. Alternatively, you may 
consider choosing a lower protocol number) 


Note 


Our logistic regression model contains several NumPy arrays, such as the weight vector, and a more 
efficient way to serialize NumPy arrays is to use the alternative joblib library. To ensure 
compatibility with the server environment that we will use in later sections, we will use the standard 
pickle approach. If you are interested, you can find more information about joblib at 
https://pyp1.python.or 1/joblib. 


We don't need to pickle the HashingVectori Pee Sipe it does not need to be fitted. Instead, we can 
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create a new Python script file, from which we can import the vectorizer into our current Python 
session. Now, copy the following code and save it as vectorizer.py inthe movieclassifier 
directory: 


Erom sk bearn. feature extraction. [ext Import HashingVeceor176r 
import re 

IMpOLrE OS 

import pickle 


Cor Cit = Os.pariscitnane. .fate ) 

stop = pickle.load (open ( 
OS<.Datis JO10 (Cur ci, 
"PRE OD)GCLS”, 
"SLODWOPOS.DKL”)y "2D" :).) 


def tokenizer (text): 


text = re.sub('<[%>]*>', '', text) 
emoticons = re.findall(' (?::|7 |=) (2:-)2(2@:\) I\C/|DIP)', 
text.lower () ) 
text = re.sub('[\W]+', ' ', text.lower({)) \ 
+ ' ',jJoin(emoticons).replace('-', !'') 
tokenized = [w for w in text.split() 1f w not in stop] 


return tokenized 


Veeck = HashingVeclorizer (decode error=" 1 0nore” ; 
i Peacures=2* 21, 
Dreprocessor=None, 
tokenizer=tokenizer) 


After we have pickled the Python objects and created the vectorizer.py file, 1t would now be a 
good idea to restart our Python interpreter or [Python Notebook kernel to test if we can deserialize the 
objects without error. However, please note that unpickling data from an untrusted source can be a 
potential security risk since the pickle module is not secure against malicious code. From your 
terminal, navigate to the movieclassifier directory, start a new Python session and execute the 
following code to verify that you can import the vectorizer and unpickle the classifier: 


Poo AMpPOre pickle 
>>> import re 
>>> AMPOLE. Os 
>>> from vectorizer import vect 
>>> clf = pickle.load (open ( 
OseParh.jorm (“oki Coyeces", 
"'classifier.pkl'), '‘'rb')) 


After we have successfully loaded the vectorizer and unpickled the classifier, we can now use these 
objects to pre-process document samples and make predictions about their sentiment: 


>>> Amport numpy as np 


>>> label = {0O:'negative', 1: 'positive'} 
>>> example = ['I love this movie'] 
>>> X = vect.transform(example) 


Pom DLInNet' Predqiceri0n: %s\nProbabili Von § 6% | 
oweboo Org 


0 x0 


oP 
= 


(label[clf.predict (X) [OJ], 
eas npemMax (Cli < Dreoter proba) 100) ) 
Prediction: positive 
Probab.irty: Yil.56s 


Since our classifier returns the class labels as integers, we defined a simple Python dictionary to map 
those integers to their sentiment. We then used the HashingVectorizer to transform the simple 
example document into a word vector x. Finally, we used the predict method of the logistic 
regression classifier to predict the class label as well as the predict proba method to return the 
corresponding probability of our prediction. Note that the predict proba method call returns an 
array with a probability value for each unique class label. Since the class label with the largest 
probability corresponds to the class label that 1s returned by the predict call, we used the np.max 
function to return the probability of the predicted class. 
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Setting up a SQLite database for data storage 


In this section, we will set up a simple SQLite database to collect optional feedback about the 
predictions from users of the web application. We can use this feedback to update our classification 
model. SQLite is an open source SQL database engine that doesn't require a separate server to 
operate, which makes it ideal for smaller projects and simple web applications. Essentially, a SQLite 
database can be understood as a single, self-contained database file that allows us to directly access 
storage files. Furthermore, SQLite doesn't require any system-specific configuration and is supported 
by all common operating systems. It has gained a reputation for being very reliable as 1t is used by 
popular companies, such as Google, Mozilla, Adobe, Apple, Microsoft, and many more. If you want 
to learn more about SQLite, I recommend you visit the official website at http://www.sqlite.org. 


Fortunately, following Python's batteries included pmlosophy, there 1s already an API 1n the Python 
standard library, sqlite3, which allows us to work with SQLite databases (for more information about 


sqlite3, please visit https://docs.python.org/3.4/library/sqlite3 html). 


By executing the following code, we will create a new SQLite database inside the movieclassifier 
directory and store two example movie reviews: 


>>> import sqlite3 
Per AMpPOLUL OS 
>>> conn = sqlite3.connect ('reviews.sglite') 
Zo C => CONnscursor() 
>>> c.execute('CREATE TABLE review db'\ 
ee ' (review TEXT, sentiment INTEGER, date TEXT) ') 
>>> examplel = 'I love this movie' 
>>> c.execute ("INSERT INTO review db"\ 
" (review, sentiment, date) VALUES"\ 
2 " (2, 2, DATETIME ('now'))", (examplel, 1)) 
>>> example2 = ‘I disliked this movie' 
>>> c.execute ("INSERT INTO review db"\ 
" (review, sentiment, date) VALUES"\ 
ea " (2, ?, DATETIME ('now'))", (example2, Q)) 
a> > CONN «COMM Lt () 
>>> conn.close() 


Following the preceding code example, we created a connection (conn) to an SQLite database file by 
calling sqlite3's connect method, which created the new database file reviews.sqlite in the 
movieclassifier directory if it didn't already exist. Please note that SQLite doesn't implement a 
replace function for existing tables; you need to delete the database file manually from your file 
browser if you want to execute the code a second time. Next, we created a cursor via the cursor 
method, which allows us to traverse over the database records using the powerful SQL syntax. Via the 
first execute call, we then created a new database table, review db. We used this to store and 
access database entries. Along with review db, we also created three columns in this database table: 
review, sentiment, and date. We used these to store two example movie reviews and respective 
class labels (sentiments). Using the SQL command DATETIME ('now'), we also added date-and 
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timestamps to our entries. In addition to the timestamps, we used the question mark symbols (7?) to 
pass the movie review texts (example1 and example2) and the corresponding class labels (1 and 0) 
as positional arguments to the execute method as members of a tuple. Lastly, we called the commit 
method to save the changes that we made to the database and closed the connection via the close 
method. 


To check if the entries have been stored in the database table correctly, we will now reopen the 
connection to the database and use the SQL SELEcT command to fetch all rows in the database table 
that have been committed between the beginning of the year 2015 and today: 


>>> conn = sqlite3.connect ('reviews.sglite') 

oo 6. = CONN, Cursor () 

>>> c.execute ("SELECT * FROM review db WHERE date"\ 

Sak " BETWEEN '2015-01-01 00:00:00' AND DATETIME ('now')") 

>>> results = c.fetchall () 

2Pr COnn. close () 

>>> print(results) 

[('I love this movie', 1, '2015-06-02 16:02:12'), ('I disliked this movie', O, 
M20 e=00=02 oO 202212") | 


Alternatively, we could also use the free Firefox browser plugin SQLite Manager (available at 


https://addons.mozilla.org/en-US/firefox/addon/sglite-manager/), which offers a nice GUI interface 


for working with SQLite databases as shown in the following screenshot: 


x (Je fie BB TTR ~CDirectory ~—> [ (Select Profile Database) 


reviews.sqlite [ Structure PRR Meee esccuen Execute SOL 


» Master Table (1) 
¥ Tables (1) 
VUNG : 0 Crowd ~=~—SCsreview ==~—— sentiment. date 
review : 1 i love this movie \1 |201 5-06-02 16:02:12 


TABLE review_db Search Show All Add 


sentiment || disliked this movie |0 |2015-06-02 16:02:12 
date 7 


> Views (0) 
» Indexes (0) 
> Triggers (0) 





| SQLite 3.8.5 Gecko 33.1 0.8.3.1-signed Exclusive § Number of files in selected directory: 7 


WOW! eBook 
www.wowebook.org 


Developing a web application with Flask 


After we have prepared the code to classify movie reviews 1n the previous subsection, let's discuss 
the basics of the Flask web framework to develop our web application. After Armin Ronacher's 
initial release of Flask in 2010, the framework has gained huge popularity over the years and 
examples of popular applications that make use of Flask include LinkedIn and Pinterest. Since Flask 
is written in Python, it provides us Python programmers with a convenient interface for embedding 
existing Python code such as our movie classifier. 


Note 


Flask 1s also known as microframework, which means that its core 1s kept lean and simple but can be 
easily extended with other libraries. Although the learning curve of the lightweight Flask API is not 
nearly as steep as those of other popular Python web frameworks, such as Django, I encourage you to 
take a look at the official Flask documentation at http://flask.pocoo.org/docs/0.10/ to learn more 
about its functionality. 


If the Flask library is not already installed in your current Python environment, you can simply install 
it via pip from your terminal (at the time of writing, the latest stable release was Version 0.10.1): 


pip install flask 
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Our first Flask web application 


In this subsection, we will develop a very simple web application to become more familiar with the 
Flask API before we implement our movie classifier. First, we create a directory tree: 


1st flask app 1/ 


app.-.Ppy 
templates/ 
Pio Oe em: 


The app. py file will contain the main code that will be executed by the Python interpreter to run the 
Flask web application. The templates directory is the directory in which Flask will look for static 
HTML files for rendering in the web browser. Let's now take a look at the contents of app. py: 


From flask import Flask; render. template 
app — Flesk( Mame ) 


(app. route ( */*) 
def index(): 
FSCUrh Bencer template ( first app.hieml”) 


Mee oe es 

app.run() 
In this case, we run our application as a single module, thus we initialized a new Flask instance with 
the argument name _ to let Flask know that it can find the HTML template folder (templates) in 
the same directory where it 1s located. Next, we used the route decorator (@app. route ('/"')) to 
specify the URL that should trigger the execution of the index function. Here, our index function 
simply renders the HTML file first app.html1, whichis located in the templates folder. Lastly, 
we used the run function to only run the application on the server when this script is directly executed 
by the Python interpreter, which we ensured using the if statement with name == '' main '. 


Now, let's take a look at the contents of the first app.html file. If you are not familiar with the 
HTML syntax yet, I recommend you visit http://www.w3schools.con/html/default.asp for useful 
tutorials for learning the basics of HTML. 


<!'doctype html> 
<html> 
<neac- 
<title>First app</title> 
</head> 
<body> 
<div>Hi, this is my first Flask web app!</div> 
</body> 
</html> 


Here, we have simply filled an empty HTML template file with a div element (a block level element) 


that contains the sentence: Hi, this is my WOM#eBedlesk web app!. Conveniently, Flask allows 
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us to run our apps locally, which 1s useful for developing and testing web applications before we 
deploy them on a public web server. Now, let's start our web application by executing the command 
from the terminal inside the 1st flask app 1 directory: 


python3 app.py 


We should now see a line such as the following displayed in the terminal: 
* Running on http://127.0.0.1:5000/ 
This line contains the address of our local server. We can now enter this address in our web browser 


to see the web application in action. If everything has executed correctly, we should now see a simple 
website with the content: Hi, this is my first Flask web app!. 
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Form validation and rendering 


In this subsection, we will extend our simple Flask web application with HTML form elements to 
learn how to collect data from a user using the WTForms library 


(https://wtforms.readthedocs.org/en/latest/), which can be installed via pip: 


pip install wtforms 


This web app will prompt a user to type in his or her name into a text field, as shown in the following 
screenshot: 


eoe =([ . “y 5 A ‘ 127.0.0.1 


What's your name’? 


Sebastian 


Say Hello 





After the submission button (Say Hello) has been clicked and the form is validated, a new HTML 
page will be rendered to display the user's name. 


oo e “C1 . 1 | A ¢ 127.0.0.1:5000/hello 


Hello Sebastian 





The new directory structure that we need to set up for this application looks like this: 


1st flask app 2/ 
aAPP-PY 
Stacic/ 
style.css 
templates/ 
_fOrmneLpers.nem. 
Lie OO eWem: 
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The following are the contents of our modified app. py file: 


From flask import Flask, tender template, 2£equesr 
from wtforms import Form, TextAreaField, validators 


app = BPlask( Meme. } 


class HelloForm(Form): 
sayhello = TextAreaField('', [validators.DataRequired() ]) 


@app.route('/") 
def index(): 
form = HelloForm(request.form) 
FetCurn: Pencer TVenplate( firs: app. himk”, LOrm=LOrm) 


@app.route('/hello', methods=['POST']) 
def hello(): 
form = HelloForm(request.form) 
1f request.method == 'POST!' and form.validate(): 
name = regquest.form['sayhello'] 
Felurh £encer Lemplave( hello.Deml* 7; Name=Name) 
Pecure Pencer template "fire. 2pp.ituL, Lorm—form) 


i= .Heme  =-— * Mar “s 
app.run (debug=True) 


Using wt forms, we extended the index function with a text field that we will embed in our start page 
using the TextAreaField class, which automatically checks whether a user has provided valid input 
text or not. Furthermore, we defined a new function, hello, which will render an HTML page 
hello.html ifthe form has been validated. Here, we used the post method to transport the form data 
to the server in the message body. Finally, by setting the argument debug=True inside the app. run 
method, we further activated Flask's debugger. This is a useful feature for developing new web 
applications. 


Now, we will implement a generic macro in the file formhelpers.htmi via the Jinja2 templating 
engine, which we will later import in our first app.html file to render the text field: 


{% macro render field(field) %} 
<dt>{{ field.label }} 
<dd>{{ field(**kwargs) |safe }} 
io at ee Ueto Ors. a.) 
<ul class=errors> 
{S$ for error in field.errors %} 
<li>{{ error }}</li> 
{os endfor 3%} 
</ul> 
{% endif %} 
</ ade 
{s endmacro ¢} 


An in-depth discussion about the Jinja2 templating language is beyond the scope of this book. 
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However, you can find a comprehensive documentation of the Jinja2 syntax at http://jinja.pocoo.org. 


Next, we set up a simple Cascading Style Sheets (CSS) file, style.css, to demonstrate how the 
look and feel of HTML documents can be modified. We have to save the following CSS file, which 
will simply double the font size of our HTML body elements, in a subdirectory called static, which 
is the default directory where Flask looks for static files such as CSS. The code is as follows: 


body { 
font-size: 2em; 


} 


The following are the contents of the modified first app.html file that will now render a text form 
where a user can enter a name: 


<!doctype html> 
<em> 
<head> 
<title>First app</title> 
[iio ..el="S oy lesneer™’ lter="{, Url Tort stacic’, tilbename=s_ylescos’) |S 
</head> 
<Docy- 


io COM  Ommnelpere. atm” AmMpOre fencer Llelo ai 


<div>What's your name?</div> 
<form method=post action="/hello"> 

<dl> 

i4 Lender trela(rorm.saynel lo) } J 

</dl> 

<INPUL LYPe=sSuUDMLE Valle="Say Hello” thame="submat Den] 
</form> 

</body— 
</html> 


In the header section of first app.html, we loaded the CSS file. It should now alter the size of all 
text elements in the HTML body. In the HTML body section, we imported the form macro from 
_formhelpers.html and we rendered the sayhello form that we specified 1n the app. py file. 
Furthermore, we added a button to the same form element so that a user can submit the text field entry. 


Lastly, we create a hello.html file that will be rendered via the line return 
render template('hello.html', name=name) inside the hello function, which we defined in the 
app.py Script to display the text that a user submitted via the text field. The code 1s as follows: 


<!doctype html> 
<—emek 
<head> 
<title>First app</title> 
Ck £elL="SLeyvecneet” rer (4 Url Tor Stacic, fitename—sryle.ces’) jp 
</head> 


<body> 
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<div>Hello {{ name }}</div> 
</body> 
<j el 


Having set up our modified Flask web application, we can run it locally by executing the following 
command from the app's main directory and we can view the result in our web browser at 
HeeOny 72760 s 0512500, 


python3 app.py 
Note 


If you are new to web development, some of those concepts may seem very complicated at first sight. 
In that case, I encourage you to simply set up the preceding files in a directory on your hard drive and 
examine them closely. You will see that the Flask web framework 1s actually pretty straightforward 
and much simpler than it might initially appear! Also, for more help, don't forget to look at the 
excellent Flask documentation and examples at http://flask.pocoo.org/docs/0.10/. 


WOW! eBook 
www.wowebook.org 


Turning the movie classifier into a web 
application 


Now that we are somewhat familiar with the basics of Flask web development, let's advance to the 
next step and implement our movie classifier into a web application. In this section, we will develop 
a web application that will first prompt a user to enter a movie review, as shown in the following 
screenshot: 


(i) reschaes. py lPeanaeryehere.comr 
MO raschees. pylhonanywhere.com 


Please enter your movie review: 


| lave this movie! 


Submit review 





After the review has been submitted, the user will see a new page that shows the predicted class 
label and the probability of the prediction. Furthermore, the user will be able to provide feedback 
about this prediction by clicking on the Correct or Incorrect button, as shown in the following 
screenshot: 


eooe ¢ al ‘ i fi) raschkas. pythonanyahere.com' results ‘ 
Your movie review: 
I love this movie! 


Prediction: 


This movie review is positive (probability: 90.86%). 


Correct = Incorrect 


Submit another review 
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If a user clicked on either the Correct or Incorrect button, our classification model will be updated 
with respect to the user's feedback. Furthermore, we will also store the movie review text provided 
by the user as well as the suggested class label, which can be inferred from the button click, ina 
SQLite database for future reference. The third page that the user will see after clicking on one of the 
feedback buttons 1s a simple thank you screen with a Submit another review button that redirects the 
user back to the start page. This 1s shown in the following screenshot: 


faeces. py honeys here coere Trek: 


Thank you for your feedback! 


Submit another review 





Before we take a closer look at the code implementation of this web application, I encourage you to 
take a look at the live demo that I uploaded at http://raschkas.pythonanywhere.com to get a better 
understanding of what we are trying to accomplish 1n this section. 


To start with the big picture, let's take a look at the directory tree that we are going to create for this 
movie classification app, which is shown here: 
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app. py 
v pkl_objects 


2 classifier. kK! 
@ stopwords.pkl 
| reviews.sqlite 
v ( static 
ie) Style.css 


v | templates 
_formhelpers.html 
results.html 
reviewform.html 
thanks.html 

vectorizer.py 





In the previous section of this chapter, we already created the vectorizer.py file, the SQLite 
database reviews.sqlite, and the pkl objects subdirectory with the pickled Python objects. 


The app. py file in the main directory 1s the Python script that contains our Flask code, and we will 
use the review.sqlite database file (which we created earlier in this chapter) to store the movie 
reviews that are being submitted to our web app. The templates subdirectory contains the HTML 
templates that will be rendered by Flask and displayed in the browser, and the static subdirectory 
will contain a simple CSS file to adjust the look of the rendered HTML code. 


Since the app. py file 1s rather long, we will conquer it 1n two steps. The first section of app. py 
imports the Python modules and objects that we are going to need, aswell as the code to unpickle and 
set up our classification model: 


[YOM ElLask amport Flask, render template, request 
from wtforms import Form, TextAreaField, validators 
import pickle 

import sqlite3 

LO Om 

import numpy as np 
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# import HashingVectorizer from local dir 
from vectorizer import vect 


app — FPlask( mame ) 


HHEPHHHEH Preparing the Classifier 


Cur Git = OS.peth.cdirnamet  ~rle .) 
Clit = prckle, load (Open (OSs .pari. 7O1n (Cur dir, 

“DR Ob ieeC ls ClaceltICr pK yy. “sb )) 
GO = ©S.pelh.jO1n (Cur O1F, “reviews .Ssqlite’:) 


def classify (document) : 
label = {0: 'negative', 1: 'positive'} 
X = vect.transform([document ] ) 
y = clf.predict (X) [0] 
proba = Tp.«Max(Cli «predict proba (x) ) 
return label[y], proba 


def train(document, y): 
X = vect.transform([document ] ) 


Clieperlial £16 (2, Tel) 


oer SQLite entry{path, Cocument, Vy): 


conn = sqlite3.connect (path) 

GS = Conn.cursor() 

C.execute ("INSERT INTO review cb (review, sentiment, dace)” \ 
" VALUES (?, ?, DATETIME('now'))", (document, y)) 


conn.commit () 
conn.close() 


This first part of the app. py script should look very familiar to us by now. We simply imported the 
HashingVectorizer and unpickled the logistic regression classifier. Next, we defined a classify 
function to return the predicted class label as well as the corresponding probability prediction of a 
given text document. The train function can be used to update the classifier given that a document 
and a class label are provided. Using the sqlite entry function, we can store a submitted movie 
review in our SQLite database along with its class label and timestamp for our personal records. 
Note that the c1£ object will be reset to its original, pickled state if we restart the web application. 
At the end of this chapter, you will learn how to use the data that we collect in the SQLite database to 
update the classifier permanently. 


The concepts in the second part of the app. py script should also look quite familiar to us: 


app = Flask( mame) 
class ReviewForm(Form): 
moviereview = TextAreaField('! 
[validators.DataRegquired(), 
validators.length (min=15) ]) 


@app.route('/') 
def index(): 
form = ReviewForm(request.form) 


return render template (' revi en eollQW eaak form=form) 
— ww.wowebook.org 


@app.route('/results', methods=['POST']) 
def results(): 
form = ReviewForm(request.form) 
if regquest.method == 'POST!'! and form.validate(): 
review = request.form['moviereview' ] 
yY, proba = classify (review) 
relLUEn KenOer templave(“tesulre nem, 
content=review, 
prediction=y, 
probabil lity=round (proba* 100, 2)) 
return render template (* reviewrorm.html’, Torm=form) 


@app.route('/thanks', methods=['POST']) 

def feedback (): 
fecaback = TPequest.1Orm|* Pecaback Dutton” | 
review = request.form['review'] 
prediction = regquest.form['prediction'] 


Iny. leabel = 4"negative + 0; “pes@terves 1} 
VY = inv babel (precicr1o1n| 
if feedback == 'Incorrect'!: 
y = int(not(y) ) 
train(review, y) 
Sqlute. Suber y (db, teview, Y) 
PeLUrh tence: Template ( thanks, Aum.) 


1f name == ' main _!: 
app.run (debug=True) 


We defined a ReviewForm Class that instantiates a TextAreaField, which will be rendered in the 
reviewform.html template file (the landing page of our web app). This, in turn, 1s rendered by the 
index function. With the validators.length(min=15) parameter, we require the user to enter a 
review that contains at least 15 characters. Inside the results function, we fetch the contents of the 
submitted web form and pass it on to our classifier to predict the sentiment of the movie classifier, 
which will then be displayed in the rendered results.html1 template. 


The feedback function may look a little bit complicated at first glance. It essentially fetches the 
predicted class label from the results.htm1 template if a user clicked on the Correct or Incorrect 
feedback button, and transforms the predicted sentiment back into an integer class label that will be 
used to update the classifier via the train function, which we implemented 1n the first section of the 
app.py script. Also, a new entry to the SQLite database will be made via the sqlite entry function 
if feedback was provided, and eventually the thanks. html template will be rendered to thank the 
user for the feedback. 


Next, let's take a look at the reviewform. html template, which constitutes the starting page of our 
application: 


<!doctype html> 


<1UmL? WOW! eBook 
<head> www.wowebook.org 


<title>Movie Classification</title> 
</head> 
<body 


<h2>Please enter your movie review:</h2> 
{7 from " formhelpers.html" import render field 7%} 


<form method=post action="/results"> 
<a> 
it £ender Liele(LOrmm.Movierevyiew, Cols="*50", Lows="10") |} 
<7 21> 
~di ve 
<input CYpPe=supmIt Valuc="SubmLte LTeview" Teme="sSubmiet Otn"> 
£6 ae 
</LoOrm-> 


</body> 
<7 hen 1.> 


Here, we simply imported the same formhelpers.htmi template that we defined 1n the Form 
validation and rendering section earlier in this chapter. The render field function of this macro is 
used to render a TextAreaField where a user can provide a movie review and submit it via the 
Submit review button displayed at the bottom of the page. This TextAreaField 1s 30 columns wide 
and 10 rows tall. 


Our next template, results.htm1, looks a little bit more interesting: 


<!doctype html> 
<—Tom LS 
<head> 
<title>Movie Classification</title> 
Ink Lel="Styleshnect’ Nrer="{. Wel POR( Static , fibename=" sr yle.css") re 
</head> 
<body> 


<h3>Your movie review:</h3> 
<div>{{ content }}</div> 


<h3>Prediction.</h3> 
<div>This movie review is <strong>{{ prediction }}</strong> 
(probability: {{ probability }}%).</div> 


<div 12d="Duceon *2 
<form action="/thanks" method="post"> 
<—Inpul type=submit Value="Correc:” mame="feechback Dullon 
<I NpuUe -Lype=submit value=" Incorrect” name="teccback. button’ 
<input type=hidden value='{{ prediction }}' name='prediction'> 
<input type=hidden value='{{ content }}' name='review'> 
</form> 
</div> 
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<form action="/"> 
<input type=submit value='Submit another review'> 
</form> 
</div> 


</body> 
</ rem 1 


First, we inserted the submitted review as well as the results of the prediction 1n the corresponding 
fields {{ content }},{{ prediction }},and{{ probability }}. You may notice that we used 
the {{ content }} and {{ prediction }} placeholder variables a second time in the form that 
contains the Correct and Incorrect buttons. This 1s a workaround to post those values back to the 
server to update the classifier and store the review in case the user clicks on one of those two buttons. 
Furthermore, we imported a CSS file (style.css) at the beginning of the results.html file. The 
setup of this file 1s quite simple; it limits the width of the contents of this web app to 600 pixels and 
moves the Incorrect and Correct buttons labeled with the div 1d button down by 20 pixels: 


body { 
wiath: GOUpx; 
i 
#boutton { 
badding=top: ZO0px; 
} 


This CSS file is merely a placeholder, so please feel free to adjust it to adjust the look and feel of the 
web app to your liking. 


The last HTML file we will implement for our web application is the thanks.htmi template. As the 
name suggests, it simply provides a nice thank you message to the user after providing feedback via 
the Correct or Incorrect button. Furthermore, we put a Submit another review button at the bottom 
of this page, which will redirect the user to the starting page. The contents of the thanks.html file 
are as follows: 


<!'doctype html> 
<html> 
<head> 
<title>Movie Classification</title> 
</head> 
<body> 


<h3>Thank you for your feedback!</h3> 
<7 2d=*buecon* > 
<tOrm aclkion="7"> 
<input type=submit value='Submit another review'> 
</form> 
5 chin 


</body> 
<7 hrni> 


WOW! eBook 
www.wowebook.org 


Now, it would be a good idea to start the web app locally from our terminal via the following 
command before we advance to the next subsection and deploy it on a public web server: 


python3 app.py 


After we have finished testing our app, we also shouldn't forget to remove the debug=True argument 
in the app. run() command of our app.py script. 
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Deploying the web application to a public 
server 


After we have tested the web application locally, we are now ready to deploy our web application 
onto a public web server. For this tutorial, we will be using the PythonAnywhere web hosting 
service, which specializes in the hosting of Python web applications and makes it extremely simple 
and hassle-free. Furthermore, PythonAnywhere offers a beginner account option that lets us run a 
single web application free of charge. 


To create a new PythonAnywhere account, we visit the website at https://www.pythonanywhere.com 
and click on the Pricing & signup link that is located in the top-right corner. Next, we click on the 
Create a Beginner account button where we need to provide a username, password, and a valid e- 
mail address. After we have read and agreed to the terms and conditions, we should have a new 
account. 


Unfortunately, the free beginner account doesn't allow us to access the remote server via the SSH 
protocol from our command-line terminal. Thus, we need to use the PythonAnywhere web interface to 
manage our web application. But before we can upload our local application files to the server, need 
to create a new web application for our PythonAnywhere account. After we clicking on the 
Dashboard button in the top-right corner, we have access to the control panel shown at the top of the 
page. Next, we click on the Web tab that is now visible at the top of the page. We proceed by clicking 
on the Add a new web app button on the left, which lets us create a new Python 3.4 Flask web 
application that we name movieclassifier. 


After creating a new application for our PythonAnywhere account, we head over to the Files tab to 
upload the files from our local movieclassifier directory using the PythonAnywhere web interface. 
After uploading the web application files that we created locally on our computer, we should have a 
movieclassifier directory in our PythonAnywhere account. It contains the same directories and 
files as our local movieclassifier directory has, as shown in the following screenshot: 
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@ pythonanywhere.com 


<h } 


“SIA pythonanywhere 


Consoles Files Web Schedule Databases 


/> home > raschkas > && movieclassifier E] Open Bash console here 


im  pycache__/ 
m@ pki_objects/ 
= static! 

= templates/ 


& app.py 
lb reviews.sqlite 


Upload afile: Choose File no file selected 6% full (32.2 MB of your 572.0 MB quota) 


Copyright @ 2015 PythonAnywhere LLP — Terms — Privacy 


Python” is a registered trademark of the Python Software Foundation. 





Lastly, we head over to the Web tab one more time and click on the Reload 
<username>.pythonanywhere.com button to propagate the changes and refresh our web application. 
Finally, our web app should now be up and running and publicly available via the address 


<username>.pythonanywhere.com. 


Note 


Unfortunately, web servers can be quite sensitive to the tiniest problems in our web app. If you are 
experiencing problems with running the web application on PythonAnywhere and are receiving error 
messages in your browser, you can check the server and error logs which can be accessed from the 
Web tab in your PythonAnywhere account to better diagnose the problem. 
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Updating the movie review classifier 


While our predictive model is updated on-the-fly whenever a user provides feedback about the 
classification, the updates to the c1£ object will be reset if the web server crashes or restarts. If we 
reload the web application, the c1£ object will be reinitialized from the classifier.pk1 pickle file. 
One option to apply the updates permanently would be to pickle the c1£ object once again after each 
update. However, this would become computationally very inefficient with a growing number of 
users and could corrupt the pickle file 1f users provide feedback simultaneously. An alternative 
solution is to update the predictive model from the feedback data that is being collected in the SQLite 
database. One option would be to download the SQLite database from the PythonAnywhere server, 
update the c1f object locally on our computer, and upload the new pickle file to PythonAnywhere. To 
update the classifier locally on our computer, we create an update. py script file in the 
movieclassifier directory with the following contents: 


import pickle 
import sqlite3 
import numpy as np 
import os 


# import HashingVectorizer from local dir 
from vectorizer import vect 


def update model(db path, model, batch size=10000): 
Conn = SC1i065 ¢cCOnnece (ab pach) 


C= Conn cursor () 
CvexeCute | DELECT “ trom. Leview Ob~) 


results = C.terchmeny(balen S176) 
while results: 
data = np.array(results) 


x= datalt, 0] 
y = data[:, 1].astype(int) 


classes = np.array([0, 1]) 

X Crain = Vvect.transrorm (x) 
CIt.~<Parttial fiuctxX Crain, VY, Classes Classes) 
besults = C.tecchmeny (balcn. Size) 


conn.close() 
return None 


Cur vOit = OS ;pathwdirname( fike ) 


Cli = pickle.Jo0e0 (open (Os .pard. JoOum(CUr dir, 
“PKI ObVeECtS", 
"classifier.pkl'), 'rb')) 

COO = OS<Pern. JOLI(CUL Or, “Devievessglive*) 


update_model (db path=db, model=clf, pa yraesyze=10000) 
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# Uncomment the following lines if you are sure that 
# you want to update your classifier.pkl file 


# permanently. 

# pickle.dump(clf, open(os.path.join(cur dir, 

il ‘PKL objects”, “Classifier.pki"), "wh") 
# , protocol=4) 


The update model function will fetch entries from the SQLite database 1n batches of 10,000 entries 
at a time unless the database contains fewer entries. Alternatively, we could also fetch one entry ata 
time by using fet chone instead of fetchmany, which would be computationally very inefficient. 
Using the alternative fetchalil method could be a problem if we are working with large datasets that 
exceed the computer or server's memory capacity. 


Now that we have created the update. py script, we could also upload it to the movieclassifier 
directory on PythonAnywhere and import the update model function in the main application script 
app.py to update the classifier from the SQLite database every time we restart the web application. 
In order to do so, we just need to add a line of code to import the update model function from the 
update.py Script at the top of app. py: 


# import update function from local dir 
from update import update model 


We then need to call the update model function in the main application body: 


Mee eS eS 


update model (filepath=db, model=-clit; batten size—L0000) 
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Summary 


In this chapter, you learned about many useful and practical topics that extend our knowledge of 
machine learning theory. You learned how to serialize a model after training and how to load it for 
later use cases. Furthermore, we created a SQLite database for efficient data storage and created a 
web application that lets us make our movie classifier available to the outside world. 


Throughout this book, we have really discussed a lot about machine learning concepts, best practices, 
and supervised models for classification. In the next chapter, we will take a look at another 
subcategory of supervised learning, regression analysis, which lets us predict outcome variables ona 
continuous scale, in contrast to the categorical class labels of the classification models that we have 
been working with so far. 
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Chapter 10. Predicting Continuous Target 
Variables with Regression Analysis 


Throughout the previous chapters, you learned a lot about the main concepts behind supervised 
learning and trained many different models for classification tasks to predict group memberships or 
categorical variables. In this chapter, we will take a dive into another subcategory of supervised 
learning: regression analysis. 


Regression models are used to predict target variables on a continuous scale, which makes them 
attractive for addressing many questions in science as well as applications in industry, such as 
understanding relationships between variables, evaluating trends, or making forecasts. One example 
would be predicting the sales of a company 1n future months. 


In this chapter, we will discuss the main concepts of regression models and cover the following 
topics: 


e Exploring and visualizing datasets 

e Looking at different approaches to implement linear regression models 
e Training regression models that are robust to outliers 

e Evaluating regression models and diagnosing common problems 

e Fitting regression models to nonlinear data 
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Introducing a simple linear regression model 


The goal of simple (univariate) linear regression is to model the relationship between a single 
feature (explanatory variable x) and a continuous valued response (target variable y). The equation of 
a linear model with one explanatory variable 1s defined as follows: 


y=w,+wex 


Here, the weight ' Yo represents the y axis intercepts and '! is the coefficient of the explanatory 
variable. Our goal 1s to learn the weights of the linear equation to describe the relationship between 
the explanatory variable and the target variable, which can then be used to predict the responses of 
new explanatory variables that were not part of the training dataset. 


Based on the linear equation that we defined previously, linear regression can be understood as 
finding the best-fitting straight line through the sample points, as shown in the following figure: 
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This best-fitting line 1s also called the regression line, and the vertical lines from the regression line 
to the sample points are the so-called offsets or residuals—the errors of our prediction. 


The special case of one explanatory variable is also called simple linear regression, but of course 


we can also generalize the linear regression mk nibACl Berl tiple explanatory variables. Hence, this 


process is called multiple linear regression: 
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Here, "® is the y axis intercept with *° ~ M 
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Exploring the Housing Dataset 


Before we implement our first linear regression model, we will introduce a new dataset, the Housing 
Dataset, which contains information about houses in the suburbs of Boston collected by D. Harrison 
and D.L. Rubinfeld in 1978. The Housing Dataset has been made freely available and can be 
downloaded from the UCI machine learning repository at 


https://archive.ics.uci.edu/ml/datasets/Housing. 


The features of the 506 samples may be summarized as shown 1n the excerpt of the dataset 
description: 


CRIM: This is the per capita crime rate by town 

ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft. 
INDUS: This 1s the proportion of non-retail business acres per town 

CHAS: This is the Charles River dummy variable (this is equal to | if tract bounds river; 0 
otherwise) 

NOX: This 1s the nitric oxides concentration (parts per 10 million) 

RM: This is the average number of rooms per dwelling 

AGE: This 1s the proportion of owner-occupied units built prior to 1940 

DIS: This is the weighted distances to five Boston employment centers 

RAD: This is the index of accessibility to radial highways 

TAX: This is the full-value property-tax rate per $10,000 

PTRATIO: This is the pupil-teacher ratio by town 

B: This is calculated as /000(Bk - 0.63)’2, where Bk is the proportion of people of African 
American descent by town 

LSTAT: This is the percentage lower status of the population 

e MEDV: This is the median value of owner-occupied homes in $1000s 


For the rest of this chapter, we will regard the housing prices (MEDV) as our target variable—the 
variable that we want to predict using one or more of the 13 explanatory variables. Before we 
explore this dataset further, let's fetch it from the UCI repository into a pandas DataFrame: 


>>> import pandas as pd 

>>> df = pd.read csv('https://archive.ics.uci.edu/ml/machine-learning- 
databases/housing/housing.data', 

ee header=None, sep='\st') 

>>> df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 

'NOX', 'RM', 'AGE', 'DIS', 'RAD', 

_— 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] 

>>> dfi.head() 


To confirm that the dataset was loaded successfully, we displayed the first five lines of the dataset, as 
shown in the following screenshot: 
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Visualizing the important characteristics of a dataset 


Exploratory Data Analysis (EDA) is an important and recommended first step prior to the training 
of a machine learning model. In the rest of this section, we will use some simple yet useful techniques 
from the graphical EDA toolbox that may help us to visually detect the presence of outliers, the 
distribution of the data, and the relationships between features. 


First, we will create a scatterplot matrix that allows us to visualize the pair-wise correlations 
between the different features in this dataset in one place. To plot the scatterplot matrix, we will use 
the pairplot function from the seaborn library (http://stanford.edu/~mwaskom/software/seaborn/), 
which is a Python library for drawing statistical plots based on matplotlib: 


Zo AMOOrL MalplLovcilab.pyplot as pit 

>>> import seaborn as sns 

>>> sns.set(style='whitegrid', context="notebook') 
>>> cols = [*RSTAT”, "INDUS", “NOX’, "RM", *MEDV" 1 
27> SiUSs.DailrploL (adr (cols), sSize—2.5); 

>>> plt.show() 


As we can see in the following figure, the scatterplot matrix provides us with a useful graphical 
summary of the relationships in a dataset: 
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Note 


Importing the seaborn library modifies the default aesthetics of matplotlib for the current Python 
session. If you do not want to use seaborn's style settings, you can reset the matplotlib settings by 
executing the following command: 
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Due to space constraints and for purposes of readability, we only plotted five columns from the 
dataset: LSTAT, INDUS, NOX, RM, and MEDV. However, you are encouraged to create a 
scatterplot matrix of the whole DataFrame to further explore the data. 


Using this scatterplot matrix, we can now quickly eyeball how the data is distributed and whether it 
contains outliers. For example, we can see that there 1s a linear relationship between RM and the 
housing prices MEDV (the fifth column of the fourth row). Furthermore, we can see in the histogram 
(the lower right subplot in the scatter plot matrix) that the MEDV variable seems to be normally 
distributed but contains several outliers. 


Note 


Note that in contrast to common belief, training a linear regression model does not require that the 
explanatory or target variables are normally distributed. The normality assumption 1s only a 
requirement for certain statistical tests and hypothesis tests that are beyond the scope of this book 
(Montgomery, D. C., Peck, E. A., and Vining, G. G. /ntroduction to linear regression analysis. John 
Wiley and Sons, 2012, pp.3 18-319). 


To quantify the linear relationship between the features, we will now create a correlation matrix. A 
correlation matrix 1s closely related to the covariance matrix that we have seen 1n the section about 
principal component analysis (PCA) 1n Chapter 4, Building Good Training Sets — Data 
Preprocessing. Intuitively, we can interpret the correlation matrix as a rescaled version of the 
covariance matrix. In fact, the correlation matrix is identical to a covariance matrix computed from 
standardized data. 


The correlation matrix 1s a square matrix that contains the Pearson product-moment correlation 
coefficients (often abbreviated as Pearson's r), which measure the linear dependence between pairs 
of features. The correlation coefficients are bounded to the range -1 and 1. Two features have a 
perfect positive correlation if “ =!, no correlation if "= 0 anda perfect negative correlation 1f 
r=—l) respectively. As mentioned previously, Pearson's correlation coefficient can simply be 


calculated as the covariance between two features * and ” (numerator) divided by the product of 
their standard deviations (denominator): 


Gee eee 
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Here, ““ denotes the sample mean of the corresponding feature, ~*” is the covariance between the 


features ¥ and 2’, and °* and ° are the fealures slandard deviations, respectively. 


Note 


We can show that the covariance between standardized features 1s in fact equal to their linear 
correlation coefficient. 


Let's first standardize the features * and -’ , to obtain their z-scores which we will denote as x and 


| respectively: 
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Remember that we calculate the (population) covariance between two features as follows: 


oy AS (0 -1,)(0" 1) 


Since standardization centers a feature variable at mean 0, we can now calculate the covariance 
between the scaled features as follows: 


0%, = > (x'-0)(y'-0) 


Through resubstitution, we get the following result: 
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We can simplify it as follows: 


In the following code example, we will use NumPy's corrcoef function on the five feature columns 
that we previously visualized in the scatterplot matrix, and we will use seaborn's heatmap function to 
plot the correlation matrix array as a heat map: 


>>> import numpy as np 


>>> cm = np.corrcoef (df[cols].values.T) 
eo Sileeeet (hOme, Sica lo=1.0) 
>>> hm = sns.heatmap (cm, 


Char=True, 

annot=True, 
Souare=Irue, 

EME=*s2Z2b 

ARNO Kiem) “Si.2e. Loy, 
yticklabels=cols, 

a4 xticklabels=cols) 

>>> plt.show() 


As we can see in the resulting figure, the correlation matrix provides us with another useful summary 
graphic that can help us to select features based on their respective linear correlations: 
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To fit a linear regression model, we are interested in those features that have a high correlation with 
our target variable MEDV. Looking at the preceding correlation matrix, we see that our target 
variable MEDV shows the largest correlation with the LSTAT variable (-0.74). However, as you 
might remember from the scatterplot matrix, there 1s a clear nonlinear relationship between LSTAT 
and MEDV. On the other hand, the correlation between RM and MEDV 1s also relatively high (0.70) 
and given the linear relationship between those two variables that we observed in the scatterplot, RM 
seems to be a good choice for an exploratory variable to introduce the concepts of a simple linear 
regression model in the following section. 
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Implementing an ordinary least squares linear 
regression model 


At the beginning of this chapter, we discussed that linear regression can be understood as finding the 
best-fitting straight line through the sample points of our training data. However, we have neither 
defined the term best-fitting nor have we discussed the different techniques of fitting such a model. In 
the following subsections, we will fill in the missing pieces of this puzzle using the Ordinary Least 
Squares (OLS) method to estimate the parameters of the regression line that minimizes the sum of the 
squared vertical distances (residuals or errors) to the sample points. 
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Solving regression for regression parameters with 
eradient descent 


Consider our implementation of the ADAptive LInear NEuron (Adaline) from Chapter 2, Training 
Machine Learning Algorithms for Classification; we remember that the artificial neuron uses a 


linear activation function and we defined a cost function J ) , which we mimmized to learn the 
weights via optimization algorithms, such as Gradient Descent (GD) and Stochastic Gradient 
Descent (SGD). This cost function in Adaline 1s the Sum of Squared Errors (SSE). This 1s identical 
to the OLS cost function that we defined: 


J ( w) — ~y(y" = ma 


a 3 


Here, -’ is the predicted value -* ~ wx (note that the term 1/2 is just used for convenience to derive 
the update rule of GD). Essentially, OLS linear regression can be understood as Adaline without the 
unit step function so that we obtain continuous target values instead of the class labels -1 and 1. To 
demonstrate the similarity, let's take the GD implementation of Adaline from Chapter 2, Training 
Machine Learning Algorithms for Classification, and remove the unit step function to implement our 
first linear regression model: 


class LinearRegressionGD (object): 


Ger 23a (eelt, €ta-0..00L, 1 acer=Z2Z0); 


self.eta = eta 
Selle 2ter = f. beer 


def fit(self, X, y): 
Sscli;W = 1p.Zer0s(. «= x.snape |.) ) 
SelisCOse. = |! 


for 2. i Lange ( sel. Leer. 
OUTPUT “= S6elLt.net anpur (x) 


errors = (y - output) 

Sclicw ‘lia | T= Seli,ete * As l wor (errors) 
Sselti.w 10] «== Seli.cta * eyrore.oum() 
cost = (errors**2).sum() / 2.0 


pelisCOsce sappenatCosr) 
return self 


eet Nec. anput(selt, Xx): 
FeLUrh Np.doOl(y, Seliew [12 )) + seliuw | 0] 
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felurn Sell .Ner 2npue() 


If you need a refresher about how the weights are being updated—taking a step in the opposite 
direction of the gradient—please revisit the Adaline section in Chapter 2, Training Machine 
Learning Algorithms for Classification. 


To see our LinearRegressionGD regressor in action, let's use the RM (number of rooms) variable 
from the Housing Data Set as the explanatory variable to train a model that can predict MEDV (the 
housing prices). Furthermore, we will standardize the variables for better convergence of the GD 
algorithm. The code is as follows: 


>>> X = df[['RM']].values 

>>> y = df['MEDV'].values 

>>> from sklearn.preprocessing import StandardScaler 
Ze SCS = Deel Cat Oo Calor 

PP SC Y= stancaroscalert) 

Poe Sta. = SC ete Pano rorm(x%) 

27 VY elo: = oe Yelle Ceano fom.) 

>>> lr = LinearRegressionGbD () 

por Mie pte See, VY eC) 


We discussed in Chapter 2, Training Machine Learning Algorithms for Classification, that it 1s 
always a good idea to plot the cost as a function of the number of epochs (passes over the training 
dataset) when we are using optimization algorithms, such as gradient descent, to check for 
convergence. To cut a long story short, let's plot the cost against the number of epochs to check 1f the 
linear regression has converged: 


eee Dileep LOU trance (iy Lie Beem), Lrseock ) 
>>> plt.ylabel ('SSE') 

Poo Dili, Xbabel (*Bpoca* ) 

>>> plt.show() 


As we can see in the following plot, the GD algorithm converged after the fifth epoch: 
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Next, let's visualize how well the linear regression line fits the training data. To do so, we will define 
a simple helper function that will plot a scatterplot of the training samples and add the regression 
line: 


poe OSl Ji. PegGp lol Us, VY, model): 
plt.scatter(X, y, c='blue') 
plt.plot(X, model.predict(X), color='red') 
return None 


Now, we will use this lin regplot function to plot the number of rooms against house prices: 


Po a Legon x. Sra, 3 seca, Ji) 

>>> plt.xlabel('Average number of rooms [RM] (standardized) ") 
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized) ') 

Zo Dic.sshOow () 


As we can see in the following plot, the linear regression line reflects the general trend that house 
prices tend to increase with the number of rooms: 
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Price in $1000's [MEDV] (standardized) 





Average number of rooms [RM] (standardized) 


Although this observation makes intuitive sense, the data also tells us that the number of rooms does 
not explain the house prices very well in many cases. Later in this chapter, we will discuss how to 


quantify the performance of a regression model. Interestingly, we also observe a curious line  =°, 


which suggests that the prices may have been clipped. In certain applications, it may also be 
important to report the predicted outcome variables on its original scale. To scale the predicted price 
outcome back on the Price in $1000's axes, we can simply apply the inverse transform method of 
the StandardScaler: 


77> NUM POOMS Stlad = SC x.transform( |.5.0]) 
Peo DPIaACSe SUC = Jr eprectcre (num, rooms sta) 
o> pranc( Price im SlLO00 Ss: G.38f" % \ 

& aes SC Vretivetoe tiene oO price Slo), 
Peice 14, S1000"s. 10.040 


In the preceding code example, we used the previously trained linear regression model to predict the 
price of a house with five rooms. According to our model, such a house is worth $10,840. 


On a side note, it is also worth mentioning that we technically don't have to update the weights of the 
intercept 1f we are working with standardized variables since the y axis intercept is always 0 1n those 
cases. We can quickly confirm this by printing the weights: 


vee Drive “MODS. w.ok” @ teew [li 
Slope: O«695 
27> DiI INES CCOu: ea5l” © Lt ew iO] ) 


T - -0.000 
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Estimating the coefficient of a regression model via 
scikit-learn 


In the previous section, we implemented a working model for regression analysis. However, 1n a 
real-world application, we may be interested in more efficient implementations, for example, scikit- 
learn's LinearRegression object that makes use of the LIBLINEAR library and advanced 
optimization algorithms that work better with unstandardized variables. This is sometimes desirable 
for certain applications: 


yer ErOm SkleGatihalinear mocel Import. Linecarkegression 
>>> slr = LinearRegression() 

vor SlieriLe Oy, WV) 

Poe PIA Shope: e.ot? «< Sif .Ccoer. 10] 

Slope: 9.102 

eee Pie TaverCepls wesc" «© Slrsamlercept +) 
Intercept: -34.6/71 


As we can see by executing the preceding code, scikit-learn's LinearRegression model fitted with 
the unstandardized RM and MEDV variables yielded different model coefficients. Let's compare it to 
our own GD implementation by plotting MEDV against RM: 


wo A eGo iy Vy me Ee) 

>>> plt.xlabel('Average number of rooms [RM] (standardized) ') 
>>> plt.ylabel('Price in $1000\'s [MEDV] (standardized) ') 

>>> plt.show() 


Now, when we plot the training data and our fitted model by executing the code above, we can see 
that the overall result looks identical to our GD implementation: 
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Price in $1000's [MEDV] 





Average number of rooms [RM] 


Note 


As an alternative to using machine learning libraries, there 1s a closed-form solution for solving OLS 
involving a system of linear equations that can be found in most introductory statistics textbooks: 


“= (Xx 2.4 | X"y 


= M, — Uy Ms 


i 


. a . 
Here, is the mean of the true target values and > ig the mean of the predicted response. 


The advantage of this method 1s that it is guaranteed to find the optimal solution analytically. 
However, if we are working with very large datasets, 1t can be computationally too expensive to 
invert the matrix in this formula (sometimes also called the normal equation) or the sample matrix 
may be singular (non-invertible), which is why we may prefer iterative methods in certain cases. 


If you are interested in more information on how to obtain the normal equations, I recommend you 
take a look at Dr. Stephen Pollock's chapter, The Classical Linear Regression Model from his 
lectures at the University of Leicester, whigh are anchaable for free at 





WOW! eBook 
www.wowebook.org 


Fitting a robust regression model using 
RANSAC 


Linear regression models can be heavily impacted by the presence of outliers. In certain situations, a 
very small subset of our data can have a big effect on the estimated model coefficients. There are 
many statistical tests that can be used to detect outliers, which are beyond the scope of the book. 
However, removing outliers always requires our own judgment as a data scientist, as well as our 
domain knowledge. 


As an alternative to throwing out outliers, we will look at a robust method of regression using the 
RANdom SAmple Consensus (RANSAC) algorithm, which fits a regression model to a subset of the 
data, the so-called inliers. 


We can summarize the iterative RANSAC algorithm as follows: 


1. Select a random number of samples to be inliers and fit the model. 

2. Test all other data points against the fitted model and add those points that fall within a user- 

given tolerance to the inliers. 

Refit the model using all inliers. 

Estimate the error of the fitted model versus the inliers. 

5. Terminate the algorithm if the performance meets a certain user-defined threshold or if a fixed 
number of iterations has been reached; go back to step | otherwise. 


oe 


Let's now wrap our linear model 1n the RANSAC algorithm using scikit-learn's RANSACRegressor 
object: 


vo? LOM Skiearn. linear model ampere RANSACRegressor 
>>> ransac = RANSACRegressor (LinearRegression(), 
Max Triels=L00, 
min samples=50, 
resioual merric=Lambda x: NpwsUm(Np.ebe (xX), exis SL), 
beoroual, threshold 7.0, 
os random state=0) 
27> Pansat. Lit (x, Y) 


We set the maximum number of iterations of the RANSACRegressor to 100, and using 

min samples=50, we Set the minimum number of the randomly chosen samples to be at least 50. 
Using the residual metric parameter, we provided a callable 1ambda function that simply 
calculates the absolute vertical distances between the fitted line and the sample points. By setting the 
residual threshold parameter to 5.0, we only allowed samples to be included in the inlier set if 
their vertical distance to the fitted line is within 5 distance units, which works well on this particular 
dataset. By default, scikit-learn uses the MAD estimate to select the inlier threshold, where MAD 
stands for the Median Absolute Deviation of the target values y. However, the choice of an 
appropriate value for the inlier threshold is proplertp specific, which is one disadvantage of 


WwW.wOW 


RANSAC. Many different approaches have been developed over the recent years to select a good 
inlier threshold automatically. You can find a detailed discussion in R. Toldo and A. Fusiello's. 
Automatic Estimation of the Inlier Threshold in Robust Multiple Structures Fitting (in Image 
Analysis and Processing—ICIAP 2009, pages 123—131. Springer, 2009). 


After we have fitted the RANSAC model, let's obtain the inliers and outliers from the fitted RANSAC 
linear regression model and plot them together with the linear fit: 


>>> 
APP 
POP 
>>> 
>>> 


>>> 
PO? 
>>> 
>>> 


>>> 
>>> 


iitter Mack. — Pancac...0iten mask 

OULLIer Mask = fp. logical noeltiniier mask) 

Ine = Npedtangeto, 10,. 1) 

Pine VY teneac = Pansec«preqter(line Ali, De«newaxie) ) 


Ples,scatter (x limiter Maskl, Vi2nlier mask), 
c='blue', marker='o', label='Inliers') 
DitssCacter(xX(OuLTLIGr Masti, VilOUtlier mask), 
c='lightgreen', marker='s', label='Outliers') 
DIT .-DIlOtlIIne xy, Line Y tansac, Color—*1ed") 
plt.xlabel ('Average number of rooms [RM]') 
plt.ylabel('Price in $1000\'s [MEDV]') 
plt.legend(loc='upper left') 
plit.show () 


As we can see in the following scatterplot, the linear regression model was fitted on the detected set 
of inliers shown as circles: 
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When we print the slope and intercept of the model executing the following code, we can see that the 
linear regression line 1s slightly different fronwhw'feb deat we obtained 1n the previous section without 
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RANSAC: 


Poe Prine SLOpe: <»5t! =« PansSece.estimalor .~coer 01) 
Slope: 9.621 

27> DEIN (* TACerCepls 2.500" 6 Pansaceestimavor «intercept ) 
Intercept: -37.137 


Using RANSAC, we reduced the potential effect of the outliers in this dataset, but we don't know if 
this approach has a positive effect on the predictive performance for unseen data. Thus, in the next 
section we will discuss how to evaluate a regression model for different approaches, which is a 
crucial part of building systems for predictive modeling. 
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Evaluating the performance of linear 
regression models 


In the previous section, we discussed how to fit a regression model on training data. However, you 
learned in previous chapters that it 1s crucial to test the model on data that it hasn't seen during 
training to obtain an unbiased estimate of its performance. 


As we remember from Chapter 6, Learning Best Practices for Model Evaluation and 
Hyperparameter Tuning, we want to split our dataset into separate training and test datasets where 
we use the former to fit the model and the latter to evaluate its performance to generalize to unseen 
data. Instead of proceeding with the simple regression model, we will now use all variables in the 
dataset and train a multiple regression model: 


yor LEOm SkKiGalrNsctOss Vealivdalionm import Liain. test. split 

>>> X = df.iloc[:, :-1].values 

>>> y = df['MEDV'].values 

tee olny X Veal, Veet, Yves = Ceo tes. sol 
xX; Vr GESt _6126=-0.3,; random. Seate=0) 

27> Si = = LinearRegression() 

Por Slietie (x Crain, Vo train) 

Poe ff tet, Pred = Sire preotce ll trai) 

poe VT eee. Drea = @ Le. preo1e. Cele) 


Since our model uses multiple explanatory variables, we can't visualize the linear regression line (or 
hyperplane to be precise) in a two-dimensional plot, but we can plot the residuals (the differences or 
vertical distances between the actual and predicted values) versus the predicted values to diagnose 
our regression model. Those residual plots are a commonly used graphical analysis for diagnosing 
regression models to detect nonlinearity and outliers, and to check if the errors are randomly 
distributed. 


Using the following code, we will now plot a residual plot where we simply subtract the true target 
variables from our predicted responses: 


Poe DityeCallel (yy ‘Utaim Pred, Y train prea = 7 Erain, 

cee c='blue', marker='o', label='Training data') 
eee DitarCactcriy Gest PreG;, YY eet prcd = Y Teor, 
c='lightgreen', marker='s', label='Test data') 
Predicted values') 

"Residuals') 


yer Piltex label 
oe Ditt« (Label 
27> DlLiwlegena ot left") 

2oo DAT. tlanes xmin=-10, xmax=50, lw=2, color='red') 
oS Piven rs, "50]) 

>>> plt.show() 


( 
( 
(1 
(y 


After executing the code, we should see a residual plot with a line passing through the x axis origin as 


shown here: 
WOW! eBook 


www.wowebook.org 


F | i 
‘5 
MeL 
_t 


- Heian a : 


Residuals 





Predicted values 


In the case of a perfect prediction, the residuals would be exactly zero, which we will probably never 
encounter in realistic and practical applications. However, for a good regression model, we would 
expect that the errors are randomly distributed and the residuals should be randomly scattered around 
the centerline. If we see patterns in a residual plot, 1t means that our model is unable to capture some 
explanatory information, which is leaked into the residuals as we can slightly see 1n our preceding 
residual plot. Furthermore, we can also use residual plots to detect outliers, which are represented by 
the points with a large deviation from the centerline. 


Another useful quantitative measure of a model's performance is the so-called Mean Squared Error 
(MSE), which is simply the average value of the SSE cost function that we minimize to fit the linear 
regression model. The MSE 1s useful to for comparing different regression models or for tuning their 
parameters via a grid search and cross-validation: 


MSE = ~5(y" - 5) 


M j=l 


Execute the following code: 


eo > TOM seco a aaeaaed ImpO re wae” 


or Princ Mesh train: .3f, test: one - 
“WOW! e 


Mean squarcd - See. train, Pr iibacscb os Glo red), 


Mean SQquareo Srrorm(y test, yy test pred))) 


We will see that the MSE on the training set is 19.96, and the MSE of the test set 1s much larger with a 
value of 27.20, which is an indicator that our model is overfitting the training data. 


Sometimes it may be more useful to report the coefficient of determination (/ ), which can be 
understood as a standardized version of the MSE, for better interpretability of the model 


performance. In other words, * * is the fraction of response variance that 1s captured by the model. 


The 2° value is defined as follows: 


Here, SSE is the sum of squared errors and SST is the total sum of squares 


,, 


SST = a “Fy ) 


Let's quickly show that * * is indeed just a rescaled version of the MSE: 


, or in other words, it is simply the variance of the response. 





j—-= 
| Ln ( ) ) 
= iil? M, 
MSE 





For the training dataset, * * is bounded betyyeen dante ledbut it can become negative for the test set. If 


R° =1) the model fits the data perfectly with a corresponding MSE = 0 | 


Evaluated on the training data, the * * of our model is 0.765, which doesn't sound too bad. However, 


the R° on the test dataset is only 0.673, which we can compute by executing the following code: 


Poo EEOm Shea i~MeeLies Dmporl t2 Score 

Por DEIN (RZ Train: ceot, LeESUs cadt’ = 
(\i2 SCOre (yy. Crain, YY train pred), 
2 SCOLe(yY- Test, Y_eSsl. pred)-)) 
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Using regularized methods for regression 


As we discussed in Chapter 3, 4 Jour of Machine Learning Classifiers Using Scikit-learn, 
regularization is one approach to tackle the problem of overfitting by adding additional information, 
and thereby shrinking the parameter values of the model to induce a penalty against complexity. The 
most popular approaches to regularized linear regression are the so-called Ridge Regression, Least 
Absolute Shrinkage and Selection Operator (LASSO) and Elastic Net method. 


Ridge regression is an L2 penalized model where we simply add the squared sum of the weights to 
our least-squares cost function: 


‘i 


. - (i) 7 ~ (i) Z 
J ( M ) Ridge _ > ( ) t ) 


= 





Here: 


MM 





bat a 





By increasing the value of the hyperparameter 4 we increase the regularization strength and shrink 


the weights of our model. Please note that we don't regularize the intercept term ie 


An alternative approach that can lead to sparse models 1s the LASSO. Depending on the 


regularization strength, certain weights can become zero, which makes the LASSO also useful as a 
supervised feature selection technique: 


J( Pesos =d(»" - 90) +4] w||, 


Here: 


WOW! eBook 
www.wowebook.org 








BIA 





nm 
Ww | = A> Ww. 
a : 


However, a limitation of the LASSO 1s that it selects at most ” variables if >". A compromise 
between Ridge regression and the LASSO 1s the Elastic Net, which has a L1 penalty to generate 
sparsity and a L2 penalty to overcome some of the limitations of the LASSO, such as the number of 
selected variables. 


nO, 9 nt mi 

T( Te A AL , 2 

Tkve => (v 5) +a dw? +a d hw, 

| ( _ ) J : A, ee ee A, J 
i | J | j | 





Those regularized regression models are all available via scikit-learn, and the usage 1s similar to the 
regular regression model except that we have to specify the regularization strength via the parameter 


A. for example, optimized via k-fold cross-validation. 

A Ridge Regression model can be initialized as follows: 

Zo LrOm Sklearn.lineer Mocel Import Ricge 

>>> ridge = Ridge(alpha=1.0) 

Note that the regularization strength is regulated alpha, which 1s similar to the parameter A 
Likewise, we can initialize a LASSO regressor from the linear model submodule: 


Zor LOM SKLGarns«linecar model import Lasso 
>>> lasso = Lasso(alpha=1.0) 


Lastly, the ElasticNet implementation allows us to vary the L1 to L2 ratio: 


ooo ELrOm skieari.laneer model ampore ElasticNner 
oe Jtesee =| Blast ienet (a piel, Lt hatio=0.0) 


For example, if we set11 ratio to1.0, the ElasticNet regressor would be equal to LASSO 
regression. For more detailed information about the different implementations of linear regression, 


please see the documentation at http://scikit-learn.org/stable/modules/linear_model.html. 
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Turning a linear regression model into a curve 
— polynomial regression 


In the previous sections, we assumed a linear relationship between explanatory and response 
variables. One way to account for the violation of linearity assumption 1s to use a polynomial 
regression model by adding polynomial terms: 


‘ae: / 
y =e Wo + Wx an WiX x + _ a. wx 


Here, @ denotes the degree of the polynomial. Although we can use polynomial regression to model a 
nonlinear relationship, it 1s still considered a multiple linear regression model because of the linear 
regression coefficients ™’. 


We will now discuss how to use the PolynomialFeatures transformer class from scikit-learn to add 


a quadratic term ( d= 2) to a simple regression problem with one explanatory variable, and compare 
the polynomial to the linear fit. The steps are as follows: 


1. Add a second degree polynomial term: 


from sklearn.preprocessing import PolynomialFeatures 
Soo & = Noserray( (2556.0, 270<0,y 274.0, 
52060; 342.0% 368.0; 
296.0; 446.0, 480.0, 
586.0])[:, np.newaxis] 


Por y = Nowalray(iZ56.4, 2504.4, 2ZozZec, 
Zoey Sea Ze OaZeZ, 
SO0.8, 260.0, SIl.Z, 


7 390.8]) 

>>> lr = LinearRegression () 

>>> pr = LinearRegression () 

>>> quadratic = PolynomialFeatures (degree=2) 


Po? & Quad = Guacraliec.tTibt tragsr orm) 


2. Fit a simple linear regression model for comparison: 


aoe Ai, fie Xs “Y) 
Poe m& Tie = Npedlange (290,000,110) [27 DD~newaxis| 
eee VY NAM tLe = Ie epreagiculx £11) 


3. Fit a multiple regression model on the transformed features for polynomial regression: 


SoS Prt eX quad, y) 
Zeer y Cuec Fle = pr. predi..ce(quactatic.£10 Pransrorm( x £1,C) ) 
Plot the results: WOW! eBook 
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Soo DlessCatter (x, VY, Jtebel="*Training pointes”) 
27 DlLeoploltnK Lit, Via. fe, 

se label='linear fit', linestyle='--') 
yore Dite DOU Lie, V Oud ic, 

..% label='gquadratic fit') 

>>> plt.legend(loc='upper left") 

>>> plt.show() 


In the resulting plot, we can see that the polynomial fit captures the relationship between the response 
and explanatory variable much better than the linear fit: 


450 
-- linear fit 
— quadratic fit 
400+| e®e training points 


350 
S00 | 


290 | 





200 == L L 4 L L =s iss - 
20 250 300 350) 400 A450 500 550 600 650 


2 Ye Vim Led = Lit. preacher, s) 

27? ¥ Cadac pred = pr.predice(x% quad) 

Poe Prank ( "l PaiiiinG MSE sLineart o.37, GCuadravic?: geot? | ~( 
Mean: Squared error (y, Y 2m pred), 

ean mean squared error(y, y quad pred) )) 

Training MSE linear: 509.4700, quadratic: 61.330 

Po> DPrant(’ Training RZ Jamear: c23T, Quadratice o,3f* «= { 
2 SCOtely, 7. Lin peed), 

eee ho SCOLeiy; VY GuaC Pred) ).) 

Treainang KRY2 dinvear: 0.032, quadratic: U.96Z 


As we can see after executing the preceding code, the MSE decreased from 570 (linear fit) to 61 
(quadratic fit), and the coefficient of determination reflects a closer fit to the quadratic model ( 
R° = 0.982 ) as opposed to the linear fit ( R° = 0.832 ) in this particular toy problem. 
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Modeling nonlinear relationships in the Housing Dataset 


After we discussed how to construct polynomial features to fit nonlinear relationships in a toy 
problem, let's now take a look at a more concrete example and apply those concepts to the data in the 
Housing Dataset. By executing the following code, we will model the relationship between house 
prices and LSTAT (percent lower status of the population) using second degree (quadratic) and third 
degree (cubic) polynomials and compare it to a linear fit. 


The code 1s as follows: 


>>> X = df[['LSTAT']].values 
>>> y = df['MEDV'].values 
>>> regr = LinearRegression () 


# create polynomial features 

>>> quadratic = PolynomialFeatures (degree=2) 
>>> cubic = PolynomialFeatures (degree=3) 

PPP Kc Guao = Quadratic. 7it Lranstorm (x) 

Per KK COOie = Cube. fit Trans tormm (x) 


# linear fit 

Po? Kk TAG = Npeabange(xX.man(), xAsMax(),; 1) is, Dpenewax1.s, 
>>> regr = regr.fit(X, y) 

oo = PeCr.preci1ce.. £17) 

Por Minear ©2 = £2 SCOre ty, Pedr. predi cr (x), 


# quadratic fit 

eo GeOr = Lect. lei. Giaa, 

eo Vo OleC Tit = Legreprer Cl (Cuecralte.Iat Pransrorm(* tis) 7 
Per GUuaotalwe 12 = 2 SCOre(y, Legr.predier(% Guad)) 


# cubic fit 

yee Det = eg. fit x. Cuole, VY) 

PAP Yo CUblAe Fite = BSC epredi Ce (Cub1 Ce. fit. transiorm( x [10)) 
27> CUDLC 12 = 22. SCOre(y, LSCrpredrvce (x Cubic) ) 


# plot results 
eer Dib~eSCaller (xX, Vz 
label='training points', 
as color='lightgray') 
Poo Dil. PlOUtx fat, Y Jin Fve, 
label='linear (d=1), SR*2=%.2£5' 
S Lonear FZ, 
COLOr="Diue*, 
lw=2, 
ar linestyle=':') 
Poo DIlLeploOut(x ft, YY Guad. 11, 
label="GuadratacG (d=Z2)y. SR°Z=c.2Z15" 
e CU ai 12; 
COLOr="7eq", 
lw=2, 
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Rae Die pVOLtX Laity Y Cubl1c LEE, 

label='cubic (d=3), SR*2=%.2£5S' 

©. CUOuLC £2, 

color='green', 

lw=2, 
er linestyle='--') 
>>> plt.xlabel('%s lower status of the population [LSTAT]') 
>>> plt.ylabel('Price in $1000\'s [MEDV]') 
>>> plt.legend(loc='upper right") 
>>> plt.show() 


As we can see in the resulting plot, the cubic fit captures the relationship between the house prices 
and LSTAT better than the linear and quadratic fit. However, we should be aware that adding more 
and more polynomial features increases the complexity of a model and therefore increases the chance 
of overfitting. Thus, in practice, itis always recommended that you evaluate the performance of the 
model on a separate test dataset to estimate the generalization performance: 


60 


linear (d=1), R° =0.54 
— quadratic (d=2), R* =0.64/7 
-- quadratic (d=3), R° =0.66 
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In addition, polynomial features are not always the best choice for modeling nonlinear relationships. 
For example, just by looking at the MEDV-LSTAT scatterplot, we could propose that a log 
transformation of the LSTAT feature variable and the square root of MEDV may project the data onto 
a linear feature space suitable for a linear regression fit. Let's test this hypothesis by executing the 
following code: 


# transform features 
eee moO = Nps Lod (Vx) 
eee SOL = Doser ly, 
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# fit features 
x fle = Ap.arange(X |00.min()=—1, 


>>> 
>>> 
Ao? 
>>> 


regr 


xX log.max()+l, -L) [ty np.newaxis | 
= £COr.tietx LOG, YY Sacer) 


VY 220 i = £eOrpred1ce (x 710) 
linear ©2 = TZ. Score .y sqru; Legr.preciche (x tog)) 


# plot results 
vor Pill goCcatte. |) Oo, Sart, 


>>> 


>>> 
>>> 
ee 
>>> 


joke 


onic 
jeonkier 
ono 
jovi cw 


lLabel="*—Tratning points”, 
COLOr="Ligncoray* ) 


ole. fi, Ye ee, 


oP 


label='linear (d=1), SR*2= 
CoO oe" bine" 
lw=2) 
xlabel('log(% lower status of the population [LSTAT])') 
ylabel("’S \Sqre{ Price \e an \e \weL000\"S [MEDV] ts") 
legend(loc='lower left") 
show () 


tie” @ lanear 12, 


After transforming the explanatory onto the log space and taking the square root of the target 
variables, we were able to capture the relationship between the two variables with a linear 


regression line that seems to fit the data better ( R° = 0.69 ) than any of the polynomial feature 
transformations previously: 


in S1000's LWEDV| 


Ty Price 


linear (d=1), R*? =0.69 
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Dealing with nonlinear relationships using random 
forests 


In this section, we are going to take a look at random forest regression, which 1s conceptually 
different from the previous regression models 1n this chapter. A random forest, which is an ensemble 
of multiple decision trees, can be understood as the sum of piecewise linear functions in contrast to 
the global linear and polynomial regression models that we discussed previously. In other words, via 
the decision tree algorithm, we are subdividing the input space into smaller regions that become more 
manageable. 


Decision tree regression 


An advantage of the decision tree algorithm 1s that it does not require any transformation of the 
features if we are dealing with nonlinear data. We remember from Chapter 3, A Tour of Machine 
Learning Classifiers Using Scikit-learn, that we grow a decision tree by iteratively splitting its 
nodes until the leaves are pure or a stopping criterion is satisfied. When we used decision trees for 
classification, we defined entropy as a measure of impurity to determine which feature split 
maximizes the Information Gain (IG), which can be defined as follows for a binary split: 


N 


p 


[ 





IG(D,x)=1(D, )- 


a . a. , : : 
Here, * is the feature to perform the split, ” is the number of samples in the parent node, / is the 


D D 


impurity function, ms is the subset of training samples in the parent node, and and are the 
subsets of training samples in the left and right child node after the split. Remember that our goal is to 
find the feature split that maximizes the information gain, or 1n other words, we want to find the 
feature split that reduces the impurities in the child nodes. In Chapter 3, A Tour of Machine Learning 
Classifiers Using Scikit-learn, we used entropy as a measure of impurity, which 1s a useful criterion 
for classification. To use a decision tree for regression, we will replace entropy as the impurity 
measure of a node ! by the MSE: 


I(t) = MSE(t) = ~ Z, ("° — j, ), 


"ft ED, 
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se (i) 
Here, Ni is the number of training samples at node /, , is the training subset at node ‘, -’ is the 


true target value, and + ig the predicted target value (Sample mean): 


3,=— > y" 


' jeD, 


In the context of decision tree regression, the MSE is often also referred to as within-node variance, 
which is why the splitting criterion 1s also better known as variance reduction. To see what the line 
fit of a decision tree looks like, let's use the DecisionTreeRegressor Implemented in scikit-learn to 
model the nonlinear relationship between the MEDV and LSTAT variables: 


>>> from sklearn.tree import DecisionTreeRegressor 


oo = OF). ote | leva loess 
>>> y = df['MEDV'].values 
Zor Eres. = Decision lreckegressor(max ceptn—3) 


>>> tree. fit(X, y) 
eo? SOTE AGx = Ait lavien() <argsert) 
por Jat. LECDLOL (A ISOre 20x], YilSort 1adx)], tree) 
>>> plt.xlabel('%s lower status of the population [LSTAT]') 
>>> plt.ylabel('Price in $1000\'s [MEDV]') 
>>> plt.show() 


As we can see from the resulting plot, the decision tree captures the general trend in the data. 
However, a limitation of this model 1s that it does not capture the continuity and differentiability of 
the desired prediction. In addition, we need to be careful about choosing an appropriate value for the 
depth of the tree to not overfit or underfit the data; here, a depth of 3 seems to be a good choice: 
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Yo lower status of the population [LSTAT] 


In the next section, we will take a look at a more robust way for fitting regression trees: random 
forests. 


Random forest regression 


As we discussed in Chapter 3, A Jour of Machine Learning Classifiers Using Scikit-learn, the 
random forest algorithm is an ensemble technique that combines multiple decision trees. A random 
forest usually has a better generalization performance than an individual decision tree due to 
randomness that helps to decrease the model variance. Other advantages of random forests are that 
they are less sensitive to outliers in the dataset and don't require much parameter tuning. The only 
parameter in random forests that we typically need to experiment with 1s the number of trees 1n the 
ensemble. The basic random forests algorithm for regression is almost identical to the random forest 
algorithm for classification that we discussed in Chapter 3, A Jour of Machine Learning Classifiers 
Using Scikit-learn. The only difference 1s that we use the MSE criterion to grow the individual 
decision trees, and the predicted target variable is calculated as the average prediction over all 
decision trees. 


Now, let's use all the features in the Housing Dataset to fit a random forest regression model on 60 
percent of the samples and evaluate its performance on the remaining 40 percent. The code is as 
follows: 


>>> X = df.iloc[:, :-l1].values 
>>> y = df['MEDV'].values 
>>> X train, X test, y train, y test =\ 


train test split(X, y, WOW! eBook 
_ = www.wowebook.org 


Lest ©176=0..4, 
mando, Sta vei) 


>>> from sklearn.ensemble import RandomForestRegressor 
>>> forest = RandomForestRegressor ( 
n estimators=1000, 
Celeron] Moe 
random state=1, 
e468 i, JORS=—1) 
poo BOVCSG. oth Ebel, Y tea) 
27? Vo Train pred = Torest.pred1Cculx Erein) 
o> VY eS. pred = LOres i «predic. (% Cese) 
>o> princt('MSE train: «.3f, test: «.3f" 3 { 

mean squared error(y train, y train pred), 
i8 Mean SQuaread Erroriy Vest, y Test pred).)) 
2S Prat ("RZ Ciel: c.3r, EOS! ceo” a 4 

2 Seeorety) Tea, \ tral Prec), 
ee f2 SCOTe(Y tese, Y Test pred), ) 
MOE Crarn: BuZ507 Teste 21.655 
Re. tio Us700,; Boots U.0 71 


Unfortunately, we see that the random forest tends to overfit the training data. However, it's still able 


to explain the relationship between the target and explanatory variables relatively well (Re = 0.871 
on the test dataset). 


Lastly, let's also take a look at the residuals of the prediction: 


Peo DiltweoCatler (yy Erainm pred, 
Y treat pred = -y Train, 
C="Dilaeck”*, 
marker='o', 
S=35, 
alpha=0.5, 
eee label='Training data') 
Poe PieweCalter ly Toot prea, 
Vy USst. prea — yy. test, 
c='lightgreen' 
marker='s', 
S=35, 
alpha=0./7, 
label='Test data") 
"Predicted values') 
"Residuals') 


Zor PLL. kLabel 
pee DLs j/labeL 
>>> plt.legend A. ‘upper left') 

>>> plt.hlines xmin=-10, xmax=50, lw=2, color='red') 
oe DPiLicesLim. (= i. "50]) 

2? DLiLesSnow () 


( 
( 
( 
(y 


As it was already summarized by the * coefficient, we can see that the model fits the training data 
better than the test data, as indicated by the outliers 1n the y axis direction. Also, the distribution of the 
residuals does not seem to be completely random; around the zero center point, indicating that the 

model is not able to capture all the exploratouy watonmallen, However, the residual plot indicates a 


large improvement over the residual plot of the linear model that we plotted earlier in this chapter: 


Zn 


Residuals 
én 





Predicted values 


Note 


In Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, we also discussed the 
kernel trick that can be used 1n combination with support vector machine (SVM) for classification, 
which is useful if we are dealing with nonlinear problems. Although a discussion is beyond of the 
scope of this book, SVMs can also be used in nonlinear regression tasks. The interested reader can 
find more information about Support Vector Machines for regression in an excellent report by S. R. 
Gunn: S. R. Gunn et al. Support Vector Machines for Classification and Regression. (ISIS technical 
report, 14, 1998). An SVM regressor 1s also implemented in scikit-learn, and more information about 
its usage can be found at http://scikit- 


learn.org/stable/modules/generated/sklearn.svm.S VR. html#sklearn.svm.SVR. 
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Summary 


At the beginning of this chapter, you learned about using simple linear regression analysis to model 
the relationship between a single explanatory variable and a continuous response variable. We then 
discussed a useful explanatory data analysis technique to look at patterns and anomalies in data, 
which is an important first step in predictive modeling tasks. 


We built our first model by implementing linear regression using a gradient-based optimization 
approach. We then saw how to utilize scikit-learn's linear models for regression and also implement a 
robust regression technique (RANSAC) as an approach for dealing with outliers. To assess the 
predictive performance of regression models, we computed the mean sum of squared errors and the 


related R° metric. Furthermore, we also discussed a useful graphical approach to diagnose the 
problems of regression models: the residual plot. 


After we discussed how regularization can be applied to regression models to reduce the model 
complexity and avoid overfitting, we also introduced several approaches to model nonlinear 
relationships, including polynomial feature transformation and random forest regressors. 


We discussed supervised learning, classification, and regression analysis, in great detail throughout 
the previous chapters. In the next chapter, we are going to discuss another interesting subfield of 
machine learning: unsupervised learning. In the next chapter, you will learn how to use cluster 
analysis for finding hidden structures 1n data in the absence of target variables. 
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Chapter 11. Working with Unlabeled Data — 
Clustering Analysis 


In the previous chapters, we used supervised learning techniques to build machine learning models 
using data where the answer was already known—the class labels were already available in our 
training data. In this chapter, we will switch gears and explore cluster analysis, a category of 
unsupervised learning techniques that allows us to discover hidden structures in data where we do 
not know the right answer upfront. The goal of clustering 1s to find a natural grouping in data such that 
items in the same cluster are more similar to each other than those from different clusters. 


Given its exploratory nature, clustering 1s an exciting topic and, 1n this chapter, you will learn about 
the following concepts that can help you to organize data into meaningful structures: 


e Finding centers of similarity using the popular k-means algorithm 
e Using a bottom-up approach to build hierarchical cluster trees 
e Identifying arbitrary shapes of objects using a density-based clustering approach 
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Grouping objects by similarity using k-means 


In this section, we will discuss one of the most popular clustering algorithms, k-means, which is 
widely used in academia as well as in industry. Clustering (or cluster analysis) 1s a technique that 
allows us to find groups of similar objects, objects that are more related to each other than to objects 
in other groups. Examples of business-oriented applications of clustering include the grouping of 
documents, music, and movies by different topics, or finding customers that share similar interests 
based on common purchase behaviors as a basis for recommendation engines. 


As we will see ina moment, the k-means algorithm is extremely easy to implement but is also 
computationally very efficient compared to other clustering algorithms, which might explain its 
popularity. The k-means algorithm belongs to the category of prototype-based clustering. We will 
discuss two other categories of clustering, hierarchical and density-based clustering, later in this 
chapter. Prototype-based clustering means that each cluster is represented by a prototype, which can 
either be the centroid (average) of similar points with continuous features, or the medoid (the most 
representative or most frequently occurring point) in the case of categorical features. While k-means 
is very good at identifying clusters of spherical shape, one of the drawbacks of this clustering 
algorithm is that we have to specify the number of clusters & a priori. An inappropriate choice for k 
can result in poor clustering performance. Later 1n this chapter, we will discuss the elbow method and 
silhouette plots, which are useful techniques to evaluate the quality of a clustering to help us 
determine the optimal number of clusters k. 


Although k-means clustering can be applied to data in higher dimensions, we will walk through the 
following examples using a simple two-dimensional dataset for the purpose of visualization: 


27> EEOm SK.Vearn.Calasets IMporl Make oOlobs 
PP? cy VY = Make Dloos (nin sanples=150, 
ir T[eavures=2, 
centers=3, 
Cluster Sed 0'. 57 
shuffle=True, 
random state=0Q) 


22> IMOOrL. MatplLovclLibD.pypDLot as pit 
o> Dimes eCarlcer (x |2,0); 

X[:,1], 

c='"white'’, 
marker='o', 

oe s=50) 

ooo Die Grid () 
>>> plt.show() 


The dataset that we just created consists of 150 randomly generated points that are roughly grouped 
into three regions with higher density, which is visualized via a two-dimensional scatterplot: 
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In real-world applications of clustering, we do not have any ground truth category information about 
those samples; otherwise, it would fall into the category of supervised learning. Thus, our goal 1s to 
group the samples based on their feature similarities, which we can be achieved using the k-means 
algorithm that can be summarized by the following four steps: 


1. Randomly pick k centroids from the sample points as initial cluster centers. 


At) re!] . pl 
. Assign each sample to the nearest centroid “, a eae 


2 

3. Move the centroids to the center of the samples that were assigned to It. 

4. Repeat the steps 2 and 3 until the cluster assignment do not change or a user-defined tolerance or 
a maximum number of iterations is reached. 


Now the next question is how do we measure similarity between objects? We can define similarity as 
the opposite of distance, and a commonly used distance for clustering samples with continuous 
features 1s the squared Euclidean distance between two points x and y in m-dimensional space: 


d ( 2%, y) — (x, -y, ) = 


j=l 


a 
a 











ix-y 


Note that, in the preceding equation, the indexy.a¢fers,tg the jth dimension (feature column) of the 
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sample points x and y. In the rest of this section, we will use the superscripts 7 and / to refer to the 
sample index and cluster index, respectively. 


Based on this Euclidean distance metric, we can describe the k-means algorithm as a simple 
optimization problem, an iterative approach for minimizing the within-cluster sum of squared errors 
(SSE), which is sometimes also called cluster inertia: 


Sane = y y yi) 


=] f=] 


Wa 
(yj) 


alt) 
cf 
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ce . . (ia) . oo 
Here, “ is the representative point (centroid) for cluster j, and “ © =! ifthe sample * is in 


(iy) \ , 
cluster j; “= otherwise. 


Now that you have learned how the simple k-means algorithm works, let's apply it to our sample 
dataset using the KMeans class from scikit-learn's cluster module: 


>>> from sklearn.cluster import KMeans 
Poo? Ki, = KMeans (1 ielusters=a, 
Litt =*rencdomn*, 
i ie, 
max iter=300, 
tol=le-04, 
ee rangdom Stete=0) 
27 FY KM — Rietat predicre (2s) 


Using the preceding code, we set the number of desired clusters to 3; specifying the number of 
clusters a priori 1s one of the limitations of k-means. We set n_init=10 to run the k-means clustering 
algorithms 10 times independently with different random centroids to choose the final model as the 
one with the lowest SSE. Via the max iter parameter, we specify the maximum number of iterations 
for each single run (here, 300). Note that the k-means implementation in scikit-learn stops early if it 
converges before the maximum number of iterations is reached. 


However, it is possible that k-means does not reach convergence for a particular run, which can be 
problematic (computationally expensive) if we choose relatively large values for max iter. One 
way to deal with convergence problems is to choose larger values for to1, which is a parameter that 
controls the tolerance with regard to the changes in the within-cluster sum-squared-error to declare 
convergence. In the preceding code, we chose a tolerance of 1e-04 (=0.0001). 
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K-meanst++ 


So far, we discussed the classic k-means algorithm that uses a random seed to place the initial 
centroids, which can sometimes result in bad clusterings or slow convergence if the initial centroids 
are chosen poorly. One way to address this issue is to run the k-means algorithm multiple times ona 
dataset and choose the best performing model in terms of the SSE. Another strategy is to place the 
initial centroids far away from each other via the k-means++ algorithm, which leads to better and 
more consistent results than the classic k-means (D. Arthur and S. Vassilvitsku. k-means++: The 
Advantages of Careful Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on 
Discrete algorithms, pages 1027—1035. Society for Industrial and Applied Mathematics, 2007). 


The initialization in k-means++ can be summarized as follows: 
1. Initialize an empty set M to store the k centroids being selected. 
() 
2. Randomly choose the first centroid “ from the input samples and assign it to M. 


d (ar, M 


lt) . 4 as : 
3. For each sample * ' that is not in M , find the minimum squared distance to any of 


the centroids in M. 


(Pp) 
4. To randomly select the next centroid “ , use a weighted probability distribution equal to 


i 


d(u'”,M) 
> a(x°,M 


5. Repeat steps 2 and 3 until K centroids are chosen. 
6. Proceed with the classic k-means algorithm. 


Note 


To use k-means++ with scikit-learn's KMeans object, we just need to set the init parameter to k- 
means++ (the default setting) instead of random. 


Another problem with k-means is that one or more clusters can be empty. Note that this problem does 
not exist for k-medoids or fuzzy C-means, an algorithm that we will discuss 1n the next subsection. 
However, this problem is accounted for in the current k-means implementation 1n scikit-learn. Ifa 
cluster is empty, the algorithm will search for the sample that is farthest away from the centroid of the 
empty cluster. Then it will reassign the centroid to be this farthest point. 


Note 


When we are applying k-means to real-world data using a Euclidean distance metric, we want to 
make sure that the features are measured on the same scale and apply z-score standardization or min- 


max scaling if necessary. 
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After we predicted the cluster labels y km and discussed the challenges of the k-means algorithm, 
let's now visualize the clusters that k-means identified in the dataset together with the cluster 
centroids. These are stored under the centers_ attribute of the fitted kKMeans object: 


Por DiltsscCaceer (x (yy Km=—0,01:, 
Aly km. ==0 521 | 5 
s=50, 
c='lightgreen', 
marker="s', 

cas label='cluster 1') 

>>> plt.scatter(X[y km ==1,0], 
X[y km ==1,1], 
s=50, 
c='orange', 
marker='o', 

a label='cluster 2') 

>>> plt.scatter(X[y km ==2,0], 
X[y_km et eae 
s=50, 
c='lightblue', 
marker='v', 

oe label='cluster 3') 

27> Pleeoealcler (kim. Cldsce® Centers |s,uly 


KMeClUSter Cencers ‘ty lly 
s=250, 

marker='*', 

c='red', 


ies label='centroids') 
Poo Dib s Legend) 
2 DLE Orids,) 
>>> plt.show() 


In the following scatterplot, we can see that k-means placed the three centroids at the center of each 
sphere, which looks like a reasonable grouping given this dataset: 
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cluster 1 

chuster. 2. | 
cluster 3 

gle. centroids | 





Although k-means worked well on this toy dataset, we need to note some of the main challenges of k- 
means. One of the drawbacks of k-means is that we have to specify the number of clusters é a priori, 
which may not always be so obvious in real-world applications, especially if we are working witha 
higher dimensional dataset that cannot be visualized. The other properties of k-means are that clusters 
do not overlap and are not hierarchical, and we also assume that there 1s at least one item 1n each 
cluster. 
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Hard versus soft clustering 


Hard clustering describes a family of algorithms where each sample in a dataset is assigned to 
exactly one cluster, as in the k-means algorithm that we discussed 1n the previous subsection. In 
contrast, algorithms for soft clustering (Sometimes also called fuzzy clustering) assign a sample to 
One or more clusters. A popular example of soft clustering 1s the fuzzy C-means (FCM) algorithm 
(also called soft k-means or fuzzy k-means). The original idea goes back to the 1970s where 
Joseph C. Dunn first proposed an early version of fuzzy clustering to improve k-means (J. C. Dunn. 4 
Fuzzy Relative of the Isodata Process and its Use in Detecting Compact Well-separated Clusters. 
1973). Almost a decade later, James C. Bedzek published his work on the improvements of the fuzzy 
clustering algorithm, which is now known as the FCM algorithm (J. C. Bezdek. Pattern Recognition 
with Fuzzy Objective Function Algorithms. Springer Science & Business Media, 2013). 


The FCM procedure is very similar to k-means. However, we replace the hard cluster assignment by 
probabilities for each point belonging to each cluster. In k-means, we could express the cluster 
membership of a sample x by a sparse vector of binary values: 


wu’ +0 
uw? 1 
gO 


(i) 
Here, the index position with value 1 indicates the cluster centroid “ the sample is assigned to 


. k=3, jefl, 2,3! | 
(assuming Sila aa a ). In contrast, a membership vector in FCM could be represented as 


follows: 


pg’ > 0.1 
un” > 0.85 
nu” > 0.05 


Here, each value falls in the range [0, 1] and represents a probability of membership to the respective 
cluster centroid. The sum of the memberships for a given sample 1s equal to 1. Similarly to the k- 
means algorithm, we can summarize the FCM algorithm in four key steps: 


1. Specify the number of k centroids and randomly assign the cluster memberships for each point. 
(0) 


gef % 
, Ee { iS 
2. Compute the cluster centroids “ J € Wowsésook 
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3. Update the cluster memberships for each point. 
4. Repeat steps 2 and 3 until the membership coefficients do not change or a user-defined tolerance 
or a maximum number of iterations is reached. 


The objective function of FCM—we abbreviate it by Jn —looks very similar to the within cluster 
Sum-squared-error that we minimize 1n k-means: 


L 


LV dw" m (iF) [x i) — yp 2 | 


=I] sl 








m €[1,00) 


oa yet ‘| = J 
hae (i). . G.14 
However, note that the membership indicator ” isnota binary value as in k-means S) 


Cae 


but a real value that denotes the cluster membership probability (0, J ). You also may have 


, 7 (i 
noticed that we added an additional exponent to : ; the exponent m, any number greater or equal to 
1 (typically m = 2), is the so-called fuzziness coefficient (or simply fuzzifier) that controls the 


degree of fuzziness. The larger the value of ” , the smaller the cluster membership wer becomes, 
which leads to fuzzier clusters. The cluster membership probability itself 1s calculated as follows: 
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For example, 1f we chose three cluster centers as in the previous k-means example, we could 


j (i) 
calculate the membership of the a sample belonging to the ““ cluster as: 
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(i) 
The center “of a cluster itself is calculated as the mean of all samples in the cluster weighted by 
the membership degree of belonging to its own cluster: 


— . ( 
(7) > ' wr 1) yA) 


Just by looking at the equation to calculate the cluster memberships, it is intuitive to say that each 
iteration in FCM is more expensive than an iteration in k-means. However, FCM typically requires 
fewer iterations overall to reach convergence. Unfortunately, the FCM algorithm is currently not 
implemented in scikit-learn. However, it has been found in practice that both k-means and FCM 
produce very similar clustering outputs, as described 1n a study by Soumi Ghosh and Sanjay K. Dubey 
(S. Ghosh and S. K. Dubey. Comparative Analysis of k-means and Fuzzy c-means Algorithms. 
IJACSA, 4:35-—38, 2013). 
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Using the elbow method to find the optimal number of 
clusters 


One of the main challenges in unsupervised learning is that we do not know the definitive answer. We 
don't have the ground truth class labels in our dataset that allow us to apply the techniques that we 
used in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter Tuning, 1n 
order to evaluate the performance of a supervised model. Thus, in order to quantify the quality of 
clustering, we need to use intrinsic metrics—such as the within-cluster SSE (distortion) that we 
discussed earlier in this chapter—to compare the performance of different k-means clusterings. 
Conveniently, we don't need to compute the within-cluster SSE explicitly as it is already accessible 
via the inertia attribute after fitting a KMeans model: 


Poe PIaNet DislOrULon:: cazi”' o KiM.tnertia ) 
DiSlLOreELon:. 72.46 


Based on the within-cluster SSE, we can use a graphical tool, the so-called elbow method, to 
estimate the optimal number of clusters & for a given task. Intuitively, we can say that, 1f k increases, 
the distortion will decrease. This 1s because the samples will be closer to the centroids they are 
assigned to. The idea behind the elbow method is to identify the value of & where the distortion 
begins to increase most rapidly, which will become more clear if we plot distortion for different 
values of k: 


>>> distortions = [] 
Sor FOr 7 Am ange tl, 11). 

kn = KMeans (a Clusters—i, 
init='k-meanst+t', 
i eh, 
Mex. - Ler 500, 
Te random state=0Q) 
So Fo te dete, 1G) 
> OLSeOrLLoOns ,append (km.1nertia. ) 
>>> plt.plot(range(1,11), distortions, marker='o"') 
27 Dit. xlabelL(’ Number of Clusters") 
>>> plt.ylabel('Distortion') 
>>> plt.show() 


As we can see 1n the following plot, the elbow 1s located at k = 3, which provides evidence that k = 3 
is indeed a good choice for this dataset: 
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Quantifying the quality of clustering via silhouette plots 


Another intrinsic metric to evaluate the quality of a clustering is silhouette analysis, which can also 
be applied to clustering algorithms other than k-means that we will discuss later in this chapter. 
Silhouette analysis can be used as a graphical tool to plot a measure of how tightly grouped the 
samples in the clusters are. To calculate the silhouette coefficient of a single sample in our dataset, 
we can apply the following three steps: 


. aff . fi) 
1. Calculate the cluster cohesion “ as the average distance betweena sample * and all other 
points in the same cluster. 


1) 
2. Calculate the cluster separation b*” from the next closest cluster as the average distance 
lt) 
between the sample * and all samples in the nearest cluster. 


Ar) as 
3. Calculate the silhouette * as the difference between cluster cohesion and separation divided 
by the greater of the two, as shown here: 


(ih [a] 
oo he = 


RY a 
max )b°",¢ 


The silhouette coefficient is bounded 1n the range -1 to 1. Based on the preceding formula, we can see 


a" T} 
that the silhouette coefficient is 0 if the cluster separation and cohesion are equal (bY = Ste), 


. ; ; : 0) (i) 
Furthermore, we get close to an ideal silhouette coefficient of 1 if bY >> a , since ” quantifies 


a at) — — ; 
how dissimilar a sample 1s to other clusters, and “tells us how similar it 1s to the other samples 1n 
its own cluster, respectively. 


The silhouette coefficient is available as silhouette samples from scikit-learn's metric module, 
and optionally the silhouette scores canbe imported. This calculates the average roe 
coefficient across all samples, which is equivalent to numpy.mean (silhouette samples (..)). By 
executing the following code, we will now create a plot of the silhouette coefficients for a i means 


clustering with K=3: 


a7 Kin. = KRMeans (a Luster s=3, 
init='k-meanstt', 
i ee LO), 
max iter=300, 
tol=le-04, 

re random state=0Q) 

>> yy SM, = Ke tit pPreadice (a) 


>>> import numpy as np WOW! eBook 
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>>> 
>>> 
>>> 
>>> 


>>> 
>>> 
>>> 


>>> 
>>> 


2 es 
>>> 
>>> 
>>> 


from Sklearn«ametrics 1mport Sililnoustte samples 
Gluster labels = NpsUnigque(y Km) 
O Clusters = Cluster. 140 leon epe it! 
SLLAOUeCLCTSG vals — silhouette samples (x, 
y_km, 
NetLiC= Cue lidean”) 
y ax lower, y ax upper = 0, 0 
yticks = [] 
POG iy ©. Ti Cnumerbate (Clutter labels): 
© Siiouertce vale = sitthouerire valely km == <<) 
GC SLINOUStte Vals.soru() 
Vax Upper += Jen(c silnouetre vals) 
color = Cm.7ett, 7 m Clusters) 
Dit«Darn (range (y ex lower, yY ax Upper), 
CG. Si itnoucrie vals, 
height=1.0, 
edgecolor='none', 
CO LOT =CO lOF) 
yticks.append((y ax lower + y ax upper) / 2) 
VY 2x 20Wwer += tem (Cc stlnovuetve vals) 
SLLROUsSTLS avg = Npcmean (si nouvelle vals) 
DIL saxviane (silhouctce.eve, 
COlLOr="Trea”", 
linestyle="--") 
DIve Vetere (7eiCcKre, Cliccer Iabele a 1) 
plt.ylabel ('Cluster') 
plt.xlabel ('Silhouette coefficient') 
PLts«show() 


Through a visual inspection of the silhouette plot, we can quickly scrutinize the sizes of the different 
clusters and identify clusters that contain outliers: 


Cluster 





0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Silhouette coefficient 
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As we can see in the preceding silhouette plot, our silhouette coefficients are not even close to 0, 
which can be an indicator of a good clustering. Furthermore, to summarize the goodness of our 
clustering, we added the average silhouette coefficient to the plot (dotted line). 


To see how a silhouette plot looks for a relatively bad clustering, let's seed the k-means algorithm 
with two centroids only: 


eo? Kit. = KMeans (i lusters=z, 
init='k-meanstt', 
i AG LO, 
Mex 1cer=o0U, 
tol=le-04, 

so random. State=0) 

2? YY kh = KM.fat pPredrecr( x) 


277 PlrbssCaller (ly Kim==0,.01), 
X [y km==0, 1], 
s=50, c='lightgreen', 
marker="'s', 
a label='cluster 1') 
yo? Diese Le (ly Kita= ty Oly 
X [y km==1, 1], 
s=50, 
c='orange', 
marker='o', 
cas label='cluster 2') 
2fo VDieascaller(km,cluster Centers |e,Uly 


kKt+Cluster Centers Lez liy 
s=250, 

marker='*', 

c='red', 


jue label='centroids') 
Po > O11 .AveGend.() 
=> Pile.) 
>>> plt.show() 


As we can see in the following scatterplot, one of the centroids falls between two of the three 
spherical groupings of the sample points. Although the clustering does not look completely terrible, it 
is suboptimal. 
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cluster 1 | 





Next we create the silhouette plot to evaluate the results. Please keep in mind that we typically do not 
have the luxury of visualizing datasets 1n two-dimensional scatterplots 1n real-world problems, since 
we typically work with data 1n higher dimensions: 


por CliUsteer labels = np.unique(y Km) 


por me CluSte Ss = Clulter Ibe! s.e epee o) 
vor siLLNOUCtCSe Vale = Si lnogelilc samples (x, 
y_km, 

be metric='euclidean') 
>>> y ax lower, y ax upper = 0, 0 
VoLeke = ti 
ee? FOr t, ai -enumer cet Cliote. Abels): 

© Ssiihoucrive vals = silhoucrice valsly kn == <<] 


eC Silhbouetve Valse.sore() 

VY ax Upper «= tem(e sitinoustice Vals) 
color = cm.jet(i / n clusters) 
pPit.barh(vange(y ax. Lower, Y ax Upper) , 

CG SitLROuUcELS Vals, 

height=1.0, 

edgecolor='none', 

OO LOTr=CO Or) 
yticks.append((y ax lower + y ax upper) / 2) 
vy ax ower += Len (Cc siJmouette vals) 

yor Si. LNOUSCTLS. avo = Np.Mean( si nouelte vals) 

Peo Diego Vie (Ss NOUS ere 4G, COlOor= "ree", 210 e ory le =") 
Per Diltis JELeKe(VeLCto, Clusrer Labels. a. a) 

>>> plt.ylabel ('Cluster') 


>>> plt.xlabel('Silhouette coefficilenpew epook 


eo DLL. Show () www.wowebook.org 


AS we can see in the resulting plot, the silhouettes now have visibly different lengths and width, 
which yields further evidence for a suboptimal clustering: 


0.4 
Silhouette coefficient 
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Organizing clusters as a hierarchical tree 


In this section, we will take a look at an alternative approach to prototype-based clustering: 
hierarchical clustering. One advantage of hierarchical clustering algorithms is that it allows us to 
plot dendrograms (visualizations of a binary hierarchical clustering), which can help with the 
interpretation of the results by creating meaningful taxonomies. Another useful advantage of this 
hierarchical approach is that we do not need to specify the number of clusters upfront. 


The two main approaches to hierarchical clustering are agglomerative and divisive hierarchical 
clustering. In divisive hierarchical clustering, we start with one cluster that encompasses all our 
samples, and we iteratively split the cluster into smaller clusters until each cluster only contains one 
sample. In this section, we will focus on agglomerative clustering, which takes the opposite 
approach. We start with each sample as an individual cluster and merge the closest pairs of clusters 
until only one cluster remains. 


The two standard algorithms for agglomerative hierarchical clustering are single linkage and 
complete linkage. Using single linkage, we compute the distances between the most similar members 
for each pair of clusters and merge the two clusters for which the distance between the most similar 
members is the smallest. The complete linkage approach is similar to single linkage but, instead of 
comparing the most similar members in each pair of clusters, we compare the most dissimilar 
members to perform the merge. This is shown in the following diagram: 


Most similar members 
(single linkage) 


Most dissimilar members 
(complete linkage) 





Note 


Other commonly used algorithms for agglomerative hierarchical clustering include average linkage 
and Ward's linkage. In average linkage, we merge the cluster pairs based on the minimum average 
distances between all group members in the two clusters. In Ward's method, those two clusters that 
lead to the minimum increase of the total withysolustesiSSE are merged. 
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In this section, we will focus on agglomerative clustering using the complete linkage approach. This 
is an iterative procedure that can be summarized by the following steps: 


I. 


OM” B&B WN 


Compute the distance matrix of all samples. 


. Represent each data point as a singleton cluster. 

. Merge the two closest clusters based on the distance of the most dissimilar (distant) members. 
. Update the similarity matrix. 

. Repeat steps 2 to 4 until one single cluster remains. 


Now we will discuss how to compute the distance matrix (step 1). But first, let's generate some 
random sample data to work with. The rows represent different observations (IDs 0 to 4), and the 
columns are the different features (X, Y, Z) of those samples: 


>>> 
>>> 
>>> 
>>> 
>>> 
Poe? 
>>> 
>>> 


import pandas as pd 

import numpy as np 

np.random.seed(123) 

VartLables = T'x*, "Yr, %Ar i 

Lebeis = Fi 0%, *2D. 14, 2D 2°, 71D 2", Ip AY 

xX = NP.fangom:. random sample (|o,2/])* 10 

df = pd.DataFrame(X, columns=variables, index=labels) 
Gr 


After executing the preceding code, we should now see the following distance matrix: 


ek) pF 


1D_2|9.807642 6.848297 | 4.809319 
1D_3|3.921175 3.431780 | 7.290497 
1D_4|4.385722 0.596779 | 3.980443 
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Performing hierarchical clustering on a distance matrix 


To calculate the distance matrix as input for the hierarchical clustering algorithm, we will use the 
pdist function from SciPy's spatial.distance submodule: 


-o> LYOM: SCiIpy.Spatial.distance amport pdist, Squarerorm 
27> LOW OST = pPo.Davtarrame(squaretorm( 

pdist(df, metric='euclidean')), 

ak columns=labels, index=labels) 

>>> row dist 


Using the preceding code, we calculated the Euclidean distance between each pair of sample points 
in our dataset based on the features X, Y, and Z. We provided the condensed distance matrix— 
returned by pdist—as input to the squareform function to create a symmetrical matrix of the pair- 


wise distances, as shown here: 
0.000000 | 4.973534 | 5.516653 | 5.899885 | 3.835396 


4.973534 | 0.000000 | 4.347073 | 5.104311 | 6.698233 


ID_2 | 5.516653 | 4.347073 | 0.000000 | 7.244262 | 8.316594 


ID_3 | 5.899885 | 5.104311 | 7.244262 | 0.000000 | 4.382864 
4 | 3.835396 | 6.698233 | 8.316594 | 4.382864 | 0.000000 


ID_ 





Next we will apply the complete linkage agglomeration to our clusters using the 1inkage function 
from SciPy's cluster.hierarchy submodule, which returns a so-called linkage matrix. 


However, before we call the Linkage function, let's take a careful look at the function documentation: 


>>> from scipy.cluster.hierarchy import linkage 

>>> help (linkage) 

ia 

rFerameters: 

y = Mearray 

A condensed or redundant distance matrix. A condensed 
distance matrix is a flat array containing the upper 
triangular of the distance matrix. This is the form 
that pdist returns. Alternatively, a collection of m 
observation vectors in n dimensions may be passed as 
anm by n array. 


method, = SEr, Optional 
The linkage algorithm to use. SeqoWwiepdoiakage Methods 
section below for full descri1mWwwwewebook.org 


metric : str, optional 
The distance metric to use. See the distance.pdist 
FUSE TON Tor 2 Ler OF Valid Gistance Metrics. 


RecuUrhnS: 
i, > Ndarray 
The hierarchical clustering encoded as a linkage matrix. 


Based on the function description, we conclude that we can use a condensed distance matrix (upper 
triangular) from the pdist function as an input attribute. Alternatively, we could also provide the 
initial data array and use the euclidean metric as a function argument in 1inkage. However, we 
should not use the squareform distance matrix that we defined earlier, since it would yield different 
distance values from those expected. To sum it up, the three possible scenarios are listed here: 


e Incorrect approach: In this approach, we use the squareform distance matrix. The code 1s as 
follows: 


>>> from scipy.cluster.hierarchy import linkage 

Por Tow Clusters = liankeage (row dist, 
method='complete', 
metric='euclidean') 


e Correct approach: In this approach, we use the condensed distance matrix. The code is as 
follows: 


Per TOW Clusters = Linkage (pais l (al, mebric=— euclidean”), 
method='complete') 


e Correct approach: [n this approach, we use the input sample matrix. The code is as follows: 


Pro? Ow Clusters = linkage (cr.values, 
method='complete', 
metric="'euclidean') 


To take a closer look at the clustering results, we can turn them to a pandas DataFrame (best viewed 
in [Python Notebook) as follows: 


77 PpO.VDatarrame (row Clusters, 
columns=['row label 1', 
"row label 2', 
(Gi Stance” , 
‘moO. OF TLems 27 Cclusc.* li, 
index=['cluster @d' %@(1+1) for 1 in 
range (row clusters. shape|0]) |) 


As shown 1n the following table, the linkage matrix consists of several rows where each row 
represents one merge. The first and second columns denote the most dissimilar members 1n each 
cluster, and the third row reports the distance between those members. The last column returns the 


count of the members in each cluster. 
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P| row label 1 | row label 2 no. of items in clust. 








Now that we have computed the linkage matrix, we can visualize the results in the form of a 
dendrogram: 


>>> from scipy.cluster.hierarchy import dendrogram 

# make dendrogram black (part 1/2) 

+f -2PrOM SC pyYscluster.Nilerarcnhy import set tink color palette 
# set link color palette(['black']) 

227 LOW OSnGr = OCnNarogrami row Clusters, 

labels=labels, 

# make dendrogram black (part 2/2) 
# color threshold=np.inf 

26s ) 

eer Dies tigit. Layoue |) 
>>> plt.ylabel ('Euclidean distance') 
>>> plt.show () 


If you are executing the preceding code or reading the e-book version of this book, you will notice 
that the branches in the resulting dendrogram are shown in different colors. The coloring scheme 1s 
derived from a list of matplotlib colors that are cycled for the distance thresholds 1n the dendrogram. 
For example, to display the dendrograms in black, you can uncomment the respective sections that I 
inserted 1n the preceding code. 
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Such a dendrogram summarizes the different clusters that were formed during the agglomerative 
hierarchical clustering; for example, we can see that the samples ID_0 and ID _4, followed by ID_1 
and ID_ 2, are the most similar ones based on the Euclidean distance metric. 


WOW! eBook 
www.wowebook.org 


Attaching dendrograms to a heat map 


In practical applications, hierarchical clustering dendrograms are often used 1n combination with a 
heat map, which allows us to represent the individual values in the sample matrix with a color code. 
In this section, we will discuss how to attach a dendrogram to a heat map plot and order the rows in 
the heat map correspondingly. 


However, attaching a dendrogram to a heat map can be a little bit tricky, so let's go through this 
procedure step by step: 


1. We create a new figure object and define the x axis position, y axis position, width, and height 
of the dendrogram via the add_axes attribute. Furthermore, we rotate the dendrogram 90 
degrees counter-clockwise. The code 1s as follows: 


>>> fig = plt.figure(figsize=(8,8)) 
PoP and = £1.0.4a00 axes (1.0.09, 0.1,0-.2; U9.) 
Poo TOW Cencr = Cendrogram( row Clusters, Ofi1en tarion=" righ’) 


2. Next we reorder the data in our initial DataFrame according to the clustering labels that can be 
accessed from the dendrogram object, which 1s essentially a Python dictionary, via the leaves 
key. The code is as follows: 


eo OL TOwClLUsSt = Cr.i< row cCendr |" leaves”) [tei] ] 


3. Now we construct the heat map from the reordered DataFrame and position it right next to the 
dendrogram: 


pre-e = PaGeaOe, exes (1 soy ekg OC ao |) 
>>> cax = axm.matshow(df rowclust, 
INCerpPOlLation= heeres.”, CMap="n0l F*) 


4. Finally we will modify the aesthetics of the heat map by removing the axis ticks and hiding the 
axis spines. Also, we will add a color bar and assign the feature and sample names to the x and 
y axis tick labels, respectively. The code is as follows: 


Per AROwSeL. SELCKS (| |) 

ye OxOucee. VErRCrs (| I.) 

>>> for 1 in axd.spines.values(): 

a Leet Vasible (False) 

yo Lae eo WO Los (Cas) 

Peo axl. seu xtiCkiabele([**] ~ dast(dr towelust.colunns).) 
Poe GxMgeee VELCKLapele([*?) | + tasltar -Oweluct.ance~)-) 
>>> plt.show() 


After following the previous steps, the heat map should be displayed with the dendrogram attached: 
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IDO 
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As we can see, the row order in the heat map reflects the clustering of the samples in the dendrogram. 
In addition to a simple dendrogram, the color-coded values of each sample and feature in the heat 
map provide us with a nice summary of the dataset. 
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Applying agglomerative clustering via scikit-learn 


In this section, we saw how to perform agglomerative hierarchical clustering using SciPy. However, 
there 1s also an AgglomerativeClustering Implementation in scikit-learn, which allows us to 
choose the number of clusters that we want to return. This is useful if we want to prune the 
hierarchical cluster tree. By setting the n cluster parameter to 2, we will now cluster the samples 
into two groups using the same complete linkage approach based on the Euclidean distance metric as 
before: 


>>> from sklearn.cluster import AgglomerativeClustering 
ZF ae = AOCLOMerariveC Luster ing (nm Clusters =Z, 
affinity='euclidean', 
ans linkage='complete') 
ee? babels = 2dCs.i Le. Predaicre (2) 
>>> print('Cluster labels: s' % labels) 
Cluster labels: [0 1 1 0 QO] 


Looking at the predicted cluster labels, we can see that the first, fourth, and fifth sample (ID_0, ID_3, 
and ID_4) were assigned to one cluster (0), and the samples ID_1 and ID 2 were assigned to a 
second cluster (1), which is consistent with the results that we can observe in the dendrogram. 
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Locating regions of high density via DBSCAN 


Although we can't cover the vast number of different clustering algorithms in this chapter, let's at least 
introduce one more approach to clustering: Density-based Spatial Clustering of Applications with 
Noise (DBSCAN). The notion of density in DBSCAN 1s defined as the number of points within a 
specified radius © . 


In DBSCAN, a special label is assigned to each sample (point) using the following criteria: 


e A point is considered as core point if at least a specified number (MinPts) of neighboring points 
fall within the specified radius © 

e A border point is a point that has fewer neighbors than MinPts within © , but lies within the © 
radius of a core point 

e All other points that are neither core nor border points are considered as noise points 


After labeling the points as core, border, or noise points, the DBSCAN algorithm can be summarized 
in two simple steps: 


1. Forma separate cluster for each core point or a connected group of core points (core points are 
connected if they are no farther away than ® ). 
2. Assign each border point to the cluster of 1ts corresponding core point. 


To get a better understanding of what the result of DBSCAN can look like before jumping to the 
implementation, let's summarize what you have learned about core points, border points, and noise 
points in the following figure: 


Noise point 
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One of the main advantages of using DBSCAN 1s that 1t does not assume that the clusters have a 
spherical shape as in k-means. Furthermore, DBSCAN 1s different from k-means and hierarchical 
clustering 1n that it doesn't necessarily assign each point to a cluster but 1s capable of removing noise 
points. 


For a more illustrative example, let's create a new dataset of half-moon-shaped structures to compare 
k-means clustering, hierarchical clustering, and DBSCAN: 
Per DPOM Skie@adrnsdatasers Import Make moons 
Po By Y = Make moons (n samples=200, 
noise=0.05, 
ee random state=0Q) 
yoo Die. SsCatcer (4.0), Mltzgil, 
> ole show) 


As we can see in the resulting plot, there are two visible, half-moon-shaped groups consisting of 100 
sample points each: 


1.3) 
1.0 
0.5 


0.0} 


~1.9L 
-1.5 





We will start by using the k-means algorithm and complete linkage clustering to see whether one of 
those previously discussed clustering algorithms can successfully identify the half-moon shapes as 
separate clusters. The code is as follows: 


Po> fT, (axl, ax2) = pll.subplorts (1, 2Zy Ligsi.ze—(5,3))) 
27? ki = KMeans ti clusters=z, 

o@ 8 random state=0Q) 

272 ¥Y Kt = Kile tie predrecr(%) 


>>> axl.scatter(X[y km==0,0], WOW! eBook 
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X [y km==0, 1], 
c='lightblue', 
marker='o', 
s=40, 
Soae label—"Cluster 1.*) 
oer AaklseSCaller( Aly Kin==1, 0), 
X LY KM== Ly dd 
c=" Tec, 
marker='s', 
s=40, 
a label='cluster 2') 
eo Oxlesee Lipke<” K-mMeans Cliusctering: ) 
eer aC. = AGGLOMeT ats vecClusteringin clusters=zZ, 
affinity='euclidean', 
os linkage='complete') 
2 YY ae = 2C.lil. Predice (a) 
yer OR seCacter (Aly 2G 0,0):, 
Xx Ly ac==0, 1), 
c=" iaghtblue*; 
marker='o', 
s=40, 
ec label='cluster 1') 
>>> ax2.scatter(X[y ac==1,0], 
X ly ac==l, 1], 
c='red', 
marker='s', 
s=40, 
a label='cluster 2") 
yer Gs Soet Litie(’ ACG lomerarive Clustering’) 
>>> plt.legend() 
ee Dit. Show () 


Based on the visualized clustering results, we can see that the k-means algorithm is unable to separate 
the two clusters, and the hierarchical clustering algorithm was challenged by those complex shapes: 


K-means clustering Agqglomerative clustering 
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oOo cluster 1 
1.0 @@e cluster ? 
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Finally, let's try the DBSCAN algorithm on this dataset to see if 1t can find the two half-moon-shaped 
clusters using a density-based approach: 


>>> from sklearn.cluster import DBSCAN 
>>> db = DBSCAN (eps=0.2, 
Min Samp.les=s, 
oe metric='euclidean') 
eee ¥ OD = OUb.tilt predic x) 
eer Pit«SCatCLer (xX ly Gb==0, 0), 
Xx [y db==-0, Ly 
c='lightblue', 
marker='o', 
s=40, 
eae label='cluster 1') 
yee Pliascacee. iy Co== 4,0), 
X [y db==1, lly 
e]="Leo ; 
marker='s', 
s=40, 
oa label='cluster 2') 
>>> plt.legend() 
>>> plt.show () 


The DBSCAN algorithm can successfully detect the half-moon shapes, which highlights one of the 
strengths of DBSCAN (clustering data of arbitrary shapes) 


1.5) 
1.0} 
0.5 
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However, we should also note some of the disadvantages of DBSCAN. With an increasing number of 


features in our dataset—given a fixed size trammyg sebekthe negative effect of the curse of 
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dimensionality increases. This 1s especially a problem if we are using the Euclidean distance metric. 
However, the problem of the curse of dimensionality 1s not unique to DBSCAN; it also affects other 
clustering algorithms that use the Euclidean distance metric, for example, the k-means and 
hierarchical clustering algorithms. In addition, we have two hyperparameters in DBSCAN (MinPts 
and € ) that need to be optimized to yield good clustering results. Finding a good combination of 
MinPts and & can be problematic if the density differences in the dataset are relatively large. 


Note 


So far, we saw three of the most fundamental categories of clustering algorithms: prototype-based 
clustering with k-means, agglomerative hierarchical clustering, and density-based clustering via 
DBSCAN. However, I also want to mention a fourth class of more advanced clustering algorithms 
that we have not covered in this chapter: graph-based clustering. Probably the most prominent 
members of the graph-based clustering family are spectral clustering algorithms. Although there are 
many different implementations of spectral clustering, they all have in common that they use the 
eigenvectors of a similarity matrix to derive the cluster relationships. Since spectral clustering 1s 
beyond the scope of this book, you can read the excellent tutorial by Ulrike von Luxburg to learn more 
about this topic (U. Von Luxburg. 4 Tutorial on Spectral Clustering. Statistics and computing, 
17(4):395—416, 2007). It is freely available from arX1v at http://arxiv.org/pdf/0711.0189v1.pdf- 


Note that, in practice, it is not always obvious which algorithm will perform best on a given dataset, 
especially if the data comes 1n multiple dimensions that make it hard or impossible to visualize. 
Furthermore, it is important to emphasize that a successful clustering does not only depend on the 
algorithm and its hyperparameters. Rather, the choice of an appropriate distance metric and the use of 
domain knowledge that can help guide the experimental setup can be even more important. 
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Summary 


In this chapter, you learned about three different clustering algorithms that can help us with the 
discovery of hidden structures or information in data. We started this chapter with a prototype-based 
approach, k-means, which clusters samples into spherical shapes based on a specified number of 
cluster centroids. Since clustering is an unsupervised method, we do not enjoy the luxury of ground 
truth labels to evaluate the performance of a model. Thus, we looked at useful intrinsic performance 
metrics such as the elbow method or silhouette analysis as an attempt to quantify the quality of 
clustering. 


We then looked at a different approach to clustering: agglomerative hierarchical clustering. 
Hierarchical clustering does not require specifying the number of clusters upfront, and the result can 
be visualized in a dendrogram representation, which can help with the interpretation of the results. 
The last clustering algorithm that we saw in this chapter was DBSCAN, an algorithm that groups 
points based on local densities and is capable of handling outliers and identifying nonglobular 
shapes. 


After this excursion into the field of unsupervised learning, it is now about time to introduce some of 
the most exciting machine learning algorithms for supervised learning: multilayer artificial neural 
networks. After their recent resurgence, neural networks are once again the hottest topic 1n machine 
learning research. Thanks to the recently developed deep learning algorithms, neural networks are 
conceived as state-of-the-art for many complex tasks such as image classification and speech 
recognition. In Chapter 12, Training Artificial Neural Networks for Image Recognition, we will 
construct our own multilayer neural network from scratch. In Chapter 13, Parallelizing Neural 
Network Training with Theano, we will introduce powerful libraries that can help us to train 
complex network architectures most efficiently. 
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Chapter 12. Training Artificial Neural 
Networks for Image Recognition 


As you may know, deep learning is getting a lot of press and 1s without any doubt the hottest topic in 
the machine learning field. Deep learning can be understood as a set of algorithms that were 
developed to train artificial neural networks with many layers most efficiently. In this chapter, you 
will learn the basic concepts of artificial neural networks so that you will be well equipped to further 
explore the most exciting areas of research 1n the machine learning field, as well as the advanced 
Python-based deep learning libraries that are currently being developed. 


The topics that we will cover are as follows: 


Getting a conceptual understanding of multi-layer neural networks 
Training neural networks for image classification 

Implementing the powerful backpropagation algorithm 

Debugging neural network implementations 
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Modeling complex functions with artificial 
neural networks 


At the beginning of this book, we started our journey through machine learning algorithms with 
artificial neurons in Chapter 2, Training Machine Learning Algorithms for Classification. Artificial 
neurons represent the building blocks of the multi-layer artificial neural networks that we are going to 
discuss 1n this chapter. The basic concept behind artificial neural networks was built upon hypotheses 
and models of how the human brain works to solve complex problem tasks. Although artificial neural 
networks have gained a lot of popularity in recent years, early studies of neural networks go back to 
the 1940s when Warren McCulloch and Walter Pitt first described how neurons could work. 
However, in the decades that followed the first implementation of the McCulloch-Pitt neuron model, 
Rosenblatt's perceptron 1n the 1950s, many researchers and machine learning practitioners slowly 
began to lose interest in neural networks since no one had a good solution for training a neural 
network with multiple layers. Eventually, interest in neural networks was rekindled in 1986 when 
D.E. Rumelhart, G.E. Hinton, and R.J. Williams were involved in the (re)discovery and 
popularization of the backpropagation algorithm to train neural networks more efficiently, which we 
will discuss in more detail later in this chapter (Rumelhart, David E.; Hinton, Geoffrey E.; Williams, 
Ronald J. (1986). Learning Representations by Back-propagating Errors. Nature 323 (6088): 533-— 
536). 


During the previous decade, many more major breakthroughs resulted 1n what we now call deep 
learning algorithms, which can be used to create feature detectors from unlabeled data to pre-train 
deep neural networks—neural networks that are composed of many layers. Neural networks are a hot 
topic not only in academic research, but also in big technology companies such as Facebook, 
Microsoft, and Google who invest heavily 1n artificial neural networks and deep learning research. 
As of today, complex neural networks powered by deep learning algorithms are considered as state- 
of-the-art when it comes to complex problem solving such as image and voice recognition. Popular 
examples of the products in our everyday life that are powered by deep learning are Google's image 
search and Google Translate, an application for smartphones that can automatically recognize text in 
images for real-time translation into 20 languages (http://googleresearch.blogspot.com/2015/07/how- 


google-translate-squeezes-deep.html). 


Many more exciting applications of deep neural networks are under active development at major tech 
companies, for example, Facebook's DeepFace for tagging images (Y. Taigman, M. Yang, M. Ranzato, 
and L. Wolf. DeepFace: Closing the gap to human-level performance in face verification. In 
Computer Vision and Pattern Recognition CVPR, 2014 IEEE Conference, pages 1701—1708) and 
Baidu's DeepSpeech, which 1s able to handle voice queries 1n Mandarin (A. Hannun, C. Case, J. 
Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al. 
DeepSpeech: Scaling up end-to-end speech recognition. arXiv preprint arX1v:1412.5567, 2014). In 
addition, the pharmaceutical industry recently started to use deep learning techniques for drug 
discovery and toxicity prediction, and research has shown that these novel techniques substantially 
exceed the performance of traditional methods, v dakweutualsgreening (T. Unterthiner, A. Mayr, G. 


Klambauer, and S. Hochreiter. Toxicity prediction using deep learning. arXiv preprint 
arXiv:1503.01445, 2015). 
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Single-layer neural network recap 


This chapter is all about multi-layer neural networks, how they work, and how to train them to solve 
complex problems. However, before we dig deeper into a particular multi-layer neural network 
architecture, let's briefly reiterate some of the concepts of single-layer neural networks that we 
introduced in Chapter 2, Training Machine Learning Algorithms for Classification, namely, the 
ADAptive LInear NEuron (Adaline ) algorithm that is shown in the following figure: 


Activation 


function 


3 : Predicted 
” class label 


Unit step 
function function 


7 os _ ' Weight 
Input coeficients 


values 





In Chapter 2, Training Machine Learning Algorithms for Classification, we 1mplemented the 
Adaline algorithm to perform binary classification, and we used a gradient descent optimization 
algorithm to learn the weight coefficients of the model. In every epoch (pass over the training set), we 
updated the weight vector ™ using the following update rule: 


wi=w+Aw, where Aw =—7VJ( Ww’) 


In other words, we computed the gradient based on the whole training set and updated the weights of 


the model by taking a step into the opposite direction of the gradient J(w) . In order to find the 
optimal weights of the model, we optimized an objective function that we defined as the Sum of 


Squared Errors (SSE) cost function tw) . Furthermore, we multiplied the gradient by a factor, the 


learning rate ‘/ , which we chose carefully to balance the speed of learning against the risk of 
overshooting the global minimum of the cost function. 


In gradient descent optimization, we updated all weights simultaneously after each epoch, and we 


defined the partial derivative for each weighty yin hepveight vector ™ as follows: 
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Here ’ is the target class label ofa particular sample * , and @ © is the activation of the neuron, 
which 1s a linear function in the special case of Adaline. Furthermore, we defined the activation 
9(:) 


function as follows: 


O(zj=z=<a 


Here, the net input = is a linear combination of the weights that are connecting the input to the output 
layer: 


| r 
= WA, HW A 
I i i 


While we used the activation b(2) to compute the gradient update, we implemented a threshold 
function (Heaviside function) to squash the continuous-valued output into binary class labels for 
prediction: 

[1 if g(z)20 


1! — 5 
| -1 otherwise 


Note 


Note that although Adaline consists of two layers, one input layer and one output layer, it is called a 
single-layer network because of its single link between the input and output layers. 
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Introducing the multi-layer neural network architecture 


In this section, we will see how to connect multiple single neurons to a multi-layer feedforward 
neural network; this special type of network is also called a multi-layer perceptron (MLP). The 
following figure explains the concept of an MLP consisting of three layers: one input layer, one 
hidden layer, and one output layer. The units in the hidden layer are fully connected to the input layer, 
and the output layer is fully connected to the hidden layer, respectively. If such a network has more 
than one hidden layer, we also call it a deep artificial neural network. 


ay y 


ga = 
"339" Laver 3° Layer 
(input layer) (nidden layer) (output layer) 





Note 


We could add an arbitrary number of hidden layers to the MLP to create deeper network architectures. 
Practically, we can think of the number of layers and units 1n a neural network as additional 
hyperparameters that we want to optimize for a given problem task using the cross-validation that 
we discussed in Chapter 6, Learning Best Practices for Model Evaluation and Hyperparameter 
Tuning. 


However, the error gradients that we will calculate later via backpropagation would become 
increasingly small as more layers are added to a network. This vanishing gradient problem makes 
the model learning more challenging. Therefore, special algorithms have been developed to pretrain 
such deep neural network structures, which is called deep learning. 


: ( 
As shown 1n the preceding figure, we denote the ! th activation unit in the th layer as “i and the 

(1) (2) 
activation units “® and “° are the bias units, respectively, which we set equal to 1. The activation 
of the units 1n the input layer is just its input plus the bias unit: 
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Each unit in layer ! is connected to all units in layer '+l viaa weight coefficient. For example, the 
( 


connection between the * th unit in layer ! to the / th unit in layer !+1 would be written as". 
: Ai) : 

Please note that the superscript ! in “mn stands for the ! th sample, not the ‘ th layer. In the following 

paragraphs, we will often omit the superscript ! for clarity. 


While one unit in the output layer would suffice for a binary classification task, we saw a more 
general form of a neural network in the preceding figure, which allows us to perform multi-class 
classification via a generalization of the One-vs-All (OvA) technique. To better understand how this 
works, remember the one-hot representation of categorical variables that we introduced in Chapter 4, 
Building Good Training Sets — Data Preprocessing. For example, we would encode the three class 
labels 1n the familiar Iris dataset (0O=Setosa, 1=Versicolor, 2=Virginica) as follows: 





This one-hot vector representation allows us to tackle classification tasks with an arbitrary number of 
unique class labels present in the training set. 


If you are new to neural network representations, the terminology around the indices (subscripts and 
(!) (7) 
e e e e ! - - it 2 
superscripts) may look a little bit confusing at first. You may wonder why we wrote © and not 


to refer to the weight coefficient that connects the 


ke 
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th unit in layer / to the / th unit in layer /+!. What may seema little bit quirky at first will make 
much more sense in later sections when we vectorize the neural network representation. For example, 


; , : . . ; (l} xd 
we will summarize the weights that connect the input and hidden layer by a matrix ¥ eR 


where ” is the number of hidden units and +! is the number of hidden units plus bias unit. Since it 
is important to internalize this notation to follow the concepts later 1n this chapter, let's summarize 
what we just discussed in a descriptive illustration of a simplified 3-4-3 multi-layer perceptron: 


layer /=1 with 3 layer /=2 with 3 
Input units (#7=3) hidden units (f=3) Layer /=3 
not counting bias not counting bias with 3 output 


units (f=3) 





connects 1° non-bias 
Number of layers: L=3 neuron in layer to the 3° 
unit layer 3 
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Activating a neural network via forward propagation 


In this section, we will describe the process of forward propagation to calculate the output of an 
MLP model. To understand how it fits into the context of learning an MLP model, let's summarize the 
MLP learning procedure in three simple steps: 


1. Starting at the input layer, we forward propagate the patterns of the training data through the 
network to generate an output. 

2. Based on the network's output, we calculate the error that we want to minimize using a cost 
function that we will describe later. 

3. We backpropagate the error, find its derivative with respect to each weight in the network, and 
update the model. 


Finally, after repeating the steps for multiple epochs and learning the weights of the MLP, we use 
forward propagation to calculate the network output and apply a threshold function to obtain the 
predicted class labels in the one-hot representation, which we described in the previous section. 


Now, let's walk through the individual steps of forward propagation to generate an output from the 


patterns in the training data. Since each unit in the hidden unit is connected to all units 1n the input 
[2] 
layers, we first calculate the activation “1 as follows: 


27) = gy) 4 gy (1) 0) 


Z,° =") oa, 4. ou a? 


a) = ¢(21”] 


(2) f 
Here, ~! is the net input and O(:) is the activation function, which has to be differentiable to learn 
the weights that connect the neurons using a gradient-based approach. To be able to solve complex 
problems such as image classification, we need nonlinear activation functions in our MLP model, for 
example, the sigmoid (logistic) activation function that we used in logistic regression in Chapter 3, A 
Tour of Machine Learning Classifiers Using Scikit-learn: 
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As we can remember, the sigmoid function 1s an S-shaped curve that maps the net input = onto a 
logistic distribution in the range 0 to 1, which passes the origin at z = 0.5, as shown in the following 


graph: 





The MLP 1s a typical example of a feedforward artificial neural network. The term feedforward 
refers to the fact that each layer serves as the input to the next layer without loops, in contrast to 
recurrent neural networks, an architecture that we will discuss later in this chapter. The term multi- 
layer perceptron may sound a little bit confusing, since the artificial neurons in this network 
architecture are typically sigmoid units, not perceptrons. Intuitively, we can think of the neurons 1n the 
MLP as logistic regression units that return values in the continuous range between 0 and 1. 


For purposes of code efficiency and readability, we will now write the activation in a more compact 
form using the concepts of basic linear algebra, which will allow us to vectorize our code 
implementation via NumPy rather than writing multiple nested and expensive Python for loops: 


a e(1} (1) 
e) =W'a 


al = 9(2 } 
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+1]x1 


(1), m . fi) . ; (1), 
Here, “is our | dimensional feature vector ofa sample * plus bias unit. ” is an 


hx|[m4+l],. . . i . co. 
| dimensional weight matrix where N is the number of hidden units in our neural network. 


oe ae ; , ele} 
After matrix-vector multiplication, we obtain the hx1 dimensional net input vector 4 to calculate 


— (2) (2) — pie . ; 
the activation @ (where @ ©). Furthermore, we can generalize this computation to all ” 
samples in the training set: 


wie) = Ww AY 


nx | mt — 1 


(ly. . . ee ; , 
Here, 4° is now an matrix, and the matrix-matrix multiplication will result ina nxn 


9(-) 


. . . _  gl2) —_— . . 
dimensional net input matrix 4°”. Finally, we apply the activation function *‘/’ to each value in the 


| a  g(2) 
net input matrix to get the MXN activation matrix 4” for the next layer (here, output layer): 


A?) — g(Z2°) 


Similarly, we can rewrite the activation of the output layer in the vectorized form: 


Fe) -_ we) 4° 


Here, we multiply the /* h matrix W™ (¢ is the number of output units) by the NN dimensional 


42) | | | _ g(3) a. | 
matrix A” to obtain the 4%” dimensional matrix Z”’ (the columns in this matrix represent the 
outputs for each sample). 


Lastly, we apply the sigmoid activation function to obtain the continuous valued output of our 
network: 


A) =9(Z"), A?) eR™ 
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Classifying handwritten digits 


In the previous section, we covered a lot of the theory around neural networks, which can be a little 
bit overwhelming if you are new to this topic. Before we continue with the discussion of the 
algorithm for learning the weights of the MLP model, backpropagation, let's take a short break from 
the theory and see a neural network in action. 


Note 


Neural network theory can be quite complex, thus I want to recommend two additional resources that 
cover some of the concepts that we discuss 1n this chapter in more detail: 


T. Hastie, J. Friedman, and R. Tibshirani. The Elements of Statistical Learning, Volume 2. Springer, 
2009. 


C. M. Bishop et al. Pattern Recognition and Machine Learning, Volume |. Springer New York, 
2006. 


In this section, we will train our first multi-layer neural network to classify handwritten digits from 
the popular MNIST dataset (short for Mixed National Institute of Standards and Technology 
database) that has been constructed by Yann LeCun et al. and serves as a popular benchmark dataset 
for machine learning algorithms (Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based 
Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11):2278-2324, 
November 1998). 
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Obtaining the MNIST dataset 


The MNIST dataset 1s publicly available at http://yann.lecun.com/exdb/mnist/ and consists of the 
following four parts: 


e Training set images: train-images-idx3-ubyte.gz (9.9 MB, 47 MB unzipped, and 60,000 


samples) 

e Training set labels: train-labels-idx1l-ubyte.gz (29 KB, 60 KB unzipped, and 60,000 
labels) 

e Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 7.8 MB, unzipped and 10,000 
samples) 


e Test set labels: t10k-labels-idx1-ubyte.gz (5 KB, 10 KB unzipped, and 10,000 labels) 


The MNIST dataset was constructed from two datasets of the US National Institute of Standards 
and Technology (NIST). The training set consists of handwritten digits from 250 different people, 50 
percent high school students, and 50 percent employees from the Census Bureau. Note that the test set 
contains handwritten digits from different people following the same split. 


After downloading the files, I recommend unzipping the files using the Unix/Linux gzip tool from the 
command line terminal for efficiency using the following command 1n your local MNIST download 
directory: 


gzip *ubyte.gz -d 


Alternatively, you could use your favorite unzipping tool if you are working with a machine running 
on Microsoft Windows. The images are stored in byte format, and we will read them into NumPy 
arrays that we will use to train and test our MLP implementation: 


import os 
Import. STLUCT 
import numpy as np 


coer 20aG, Nniet (path, Kind="traim”): 

ver hWoad MNLoT. Gata rom “pach. “"" 

labels path = OS spath.7oOlm (par, 
'Ss-labels-idxl-ubyte' 
& kind) 

images. Path — O8.0atn. JOM (pati, 
'Ss-images-1idx3-ubyte' 
& kind) 


Wit Open (babelLs path, ~“*5*) as dbpaca: 
Mag.Gy 0 = Struct .unpack(*>Li*,; 
lbpath.read (8) ) 
labels = np.fromfile(lbpath, 
dtype=np.uints) 


With: Open(images Datla, *rb") @sS a2mgpartn: 


magic, num, rows, cols = st YOu GROMtk (">IIII™, 
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imgpath.read(16) ) 
images = np.fromfile(imgpath, 
dtype=np.uint8) .reshape(len(labels), 784) 


return images, labels 


The load _mnist function returns two arrays, the first being an "'*/"' dimensional NumPy array 
(images), where " 1s the number of samples and /! is the number of features. The training dataset 
consists of 60,000 training digits and the test set contains 10,000 samples, respectively. The images in 


the MNIST dataset consist of 28* 28 pixels, and each pixel is represented by a gray scale intensity 
value. Here, we unroll the 28* 28 pixels into 1D row vectors, which represent the rows 1n our image 
array (784 per row or image). The second array (labels) returned by the load mnist function 
contains the corresponding target variable, the class labels (integers 0-9) of the handwritten digits. 


The way we read in the image might seem a little bit strange at first: 


magic, 0 = StTruct.unpack("*S1i1', Lbpath.«read(s)) 
labels = np.fromfile(lbpath, dtype=np.int8) 


To understand how these two lines of code work, let's take a look at the dataset description from the 
MNIST website: 


[offset] [type] [value] [description] 


000 32 Int integer OxO0000801(2029) magic number (MSB first) 


OO0L 32 litinteger 60000 number of ttenis 
O008 unsigned byte 2? label 
0009 unsigned byte ?? label 
WNXX unsigned byte 7? label 


Using the two lines of the preceding code, we first read in the magic number, whichis a description 
of the file protocol as well as the number of items (n) from the file buffer before we read the 
following bytes into a NumPy array using the fromfile method. The fmt parameter value >r11 that 
we passed as an argument to struct .unpack has two parts: 


e >: This is the big-endian (defines the order in which a sequence of bytes is stored); if you are 
unfamiliar with the terms big-endian and small-endian, you can find an excellent article about 


Endianness on Wikipedia (https://en.wikipedia.org/wiki/Endianness). 


e 1: This is an unsigned integer. 
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By executing the following code, we will now load the 60,000 training instances as well as the 
10,000 test samples from the mnist directory where we unzipped the MNIST dataset: 


Poo & Chain, VY tiein = 10ac Malol( "nest, Kina] train) 
eos Princ * Rows: cd, columns: 2d? 


es o. AX CreIi. Shape | )> ~ rain sstape |. ])) 
Rows: 60000, columns: 784 


PoP K Ces, Y test = load mast ( mnise’, kKano="tlUk*) 
2or DEINE ROws: «ad, Columis: 7d.’ 

eas e (X Lest. shape|0], x» tesctashape|.t))) 
Rows: 10000, columns: 784 


To get a idea what the images in MNIST look like, let's visualize examples of the digits 0-9 after 
reshaping the 784-pixel vectors from our feature matrix into the original 28 < 28 image that we can 
plot via matplotlib's imshow function: 


So MNOOre Macro LOCO. Dp yO.kOu. as pe 
>>> fig, ax = plt.subplots (nrows=2, ncols=5, sharex=True, sharey=True, ) 
>>> ax = ax.flatten() 
oo LOr i ai range (10): 
img. = X train fy train == 2] LO). ewesheape (Zc, 26) 
ies ax[1i].imshow(img, cmap='Greys', interpolation='nearest') 
PP ex 0 )'<see SEreks (/ |) 
27 ex Ul) sSee YCUCks (1) 
Pro Pilitetrgat -kayout() 
>>> plt.show() 


We should now see a plot of the 2x5 subfigures showing a representative image of each unique 
digit: 
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In addition, let's also plot multiple examples of the same digit to see how different those handwriting 
examples really are: 


Soe Tig; ax = plussubplours (1rows—5, 
ncols=5, 
Ssharex=True, 
. 38 Ssharey=True, ) 
>>> ax = ax. flatten () 
>>> for 1 in range(25): 
Ine = x train ly avai == 7) la). Peshapet2o, 2s) 
eee ax[1].imshow(img, cmap='Greys', interpolation='nearest') 
ero ax | 0) .set Keacks ( [1] 
POF OX 0) Set YEECKS (1 ]) 
27 Diese gal. Layover () 
>>> plt.show() 


After executing the code, we should now see the first 25 variants of the digit 7. 





Optionally, we can save the MNIST image data and labels as CSV files to open them in programs that 
do not support their special byte format. However, we should be aware that the CSV file format will 
take up substantially more space on your local drive, as listed here: 


Liew IMG .Csy. 109.5 MB 
train labels.csv: 120 KB 
test img.csv: 18.3 MB 
- WOW! eBook 
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If we decide to save those CSV files, we can execute the following code in our Python session after 
loading the MNIST data into NumPy arrays: 


27? pe Savelne *Etaim 2g .<CoV"; 7. train, 
Oo 


se fmt='%s1i', delimiter="',') 
Ser NPseSaveltxre( tiraim Jabels.csv’, YY ocrain, 


a fmt='%s1i', delimiter="',') 
Pa WpissevelkG(’ Gest. 1mCg.Csy", x Leck, 
S38 fmt='%1i', delimiter="',') 
Pro MDesSeaveCxe (Lest tabelse.ceov", VY Uest;, 


(e) 


fmt='%s1i', delimiter="',') 


Once we have saved the CSV files, we can load them back into Python using NumPy's genfromtxt 
function: 


eae  Crole = Mp. Cer LCOme el erat. tio .coy* , 
ae dtype=int, delimiter="',') 
PA? VF teal. = Npegentromexc(’ train. Labels scsv", 


See dtype=int, delimiter=',') 

PoP K CSst: = MP. Gent romexe(” test. 1mG.Csy ; 

a dtype=int, delimiter="',') 

por FS Vest > NP.Centromext( test labels, csv', 
dtype=int, delimiter=',') 


However, it will take substantially longer to load the MNIST data from the CSV files, thus I 
recommend you stick to the original byte format if possible. 
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Implementing a multi-layer perceptron 


In this subsection, we will now implement the code of an MLP with one input, one hidden, and one 
output layer to classify the images in the MNIST dataset. I have tried to keep the code as simple as 
possible. However, it may seem a little bit complicated at first, and I encourage you to download the 
sample code for this chapter from the Packt Publishing website, where you can find this MLP 
implementation annotated with comments and syntax highlighting for better readability. If you are not 
running the code from the accompanying IPython notebook, I recommend you copy it into a Python 
script file in your current working directory, for example, neuralnet.py, which you can then import 
into your current Python session via the following command: 


from neuralnet import NeuralNetMLP 


The code will contain parts that we have not talked about yet, such as the backpropagation algorithm, 
but most of the code should look familiar to you based on the Adaline implementation in Chapter 2, 
Training Machine Learning Algorithms for Classification, and the discussion of forward 
propagation in earlier sections. Do not worry if not all of the code makes immediate sense to you; we 
will follow up on certain parts later in this chapter. However, going over the code at this stage can 
make it easier to follow the theory later. 


import numpy as np 
from scipy.special import expit 
import sys 


class NeuralNetMLP (object): 
aqef i init (self, n_ output, n features, n hidden=30, 
11=0.0, 12=0.0, epochs=500, eta=0.001, 
ealpha=U.0,; CeCcrease. Const -U.0, ShUulLtLe-lric, 
mintbatcnes=l, Landom sStave—None) 
Np«PanCom. see: (ranoom - State) 
Ssecliet OULOUL = m CuUrpuL 


Selial TedLures = 1 sealures 
Sells Nedoen = m nadden 
Seli.Wl, Selii«wZ = Sell, 101 lialize welonvs() 


self.l11 = ll 

self.12 = 12 

self.epochs = epochs 

self.eta = eta 

self.alpha = alpha 

Sell sOSCrE Gace ‘CONnSt = OSCredse CONST 
self.shuffle = shuffle 
self.minibatches = minibatches 


GCr encoce tabetls(seliy Vy KK): 
onehot = np.zeros((k, y.shapel[0Q])) 
for idx, val in enumerate(y): 
onehot[val, idx] = 1.0 
return onehot 


def initialize weights(self); WOW! eBook 
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def 


def 


def 


def 


def 


def 


def 


def 


wh = Hp.vendom.Unirorm(—-1.0,y Le; 
eI4e-oelLi eh. Deceit =(celta ~Sceluies a 1) 
WE = Wil.reshape(seli.n Dr1cden, Sseli.m teacures + J) 
W2 = Np. random. UnLrorm(-1.0, |£.0, 
SiI7e—seli.n. OULPUL* (seli.n locenm a 1)) 
W2 = W2Z.resnape (Selt.tm OUCpUL, SEeli.n Dreden + 1) 
return wl, w2 


Sonora Set, 47> 
# expit is equivalent to 1.0/(1.0 + np.exp(-z)) 
return expit (z) 


_S81:90M0L0. Gradient (sell, 2): 
so = selti. S10mo101(Z) 
return sg * (1 - sg) 


300, Dias Unit (seli, x, how="colunn”): 
1f how == 'column': 
X new = np.ones ( 
X new[:, 1l:] = X 
elif how == 'row!'!: 
X New = Np.sones ((xX.shape(Ul+tl, XxX.«Sshape| 1 ].).) 
xX newlils, ¢] = x 
else: 
raise AttributeError('’ how’ must be ‘column’ or ‘row’') 
return X new 


(X.shape[O], X.shape[1]+1)) 


_feedforward(self, X, wl, w2): 

al = S617. add Dias Unc, ow=" column” ) 
Z2 = wl.dot(al.T) 

an = Selt. sSrCmo1e( 27) 

aa = S6li. add Dias UnIt(aZz, how="row~) 
Zo = W2.COU (az) 

ao = Sele Sigmoic(Z3) 


FEEULCMY Gly ZZ, OZ, Zoe 23 


—L2 reg(self, lambda, wl, w2): 

return (lambda (2.0) ~ (mossumiwits;, Le] ** 2) 
+ np.sum(w2[:, 1:] ** 2)) 

isk, SEG (SEL, tamoGa » Wily WZ). 

return (lambda /2.0) * (np.abs(wl[:, 1:]).sum() \ 
+ np.abs(w2[:, 1:]).sum()) 

St cost(selt, y enc, Sutpul;, wily, WZ): 

term = =" enc * (p.LoCc (OuEpUuZ) } 

tei, = (1h = Ven) * Nb.tOg tL. = eco) 

cost = np.sum(terml - term2) 


iit ter = selt.,. it rectselrt.il, wl, wz) 
bZ term = selt. t2 reqtselt’.12, wl, wz) 
Cost. = Cost = Ll term a bz. term 

Peeury €OsE 


get gradient(self, al, a2, wowiéBookyY enc, wl, w2): 
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# backpropagation 


Soma = as. = Y -eic 

Zc = S610. ed. Olas Unal(Z2,;, Now=" Tow’ ) 

SiLgmaZ = W2Z.TsCoOU(sigmas) ~*~ selr. S10mMO1d Gradient (22) 
Sigma2 = sigma2[1l:, :] 


gradl = sigma2.dot(al) 
grad2 = sigma3.dot(a2.T) 


# regularize 
Grad (es, 22) += (wlie, 
GradgzZiie, Ile] += (ais, 


>] * (self.1l1 + self.12)) 
cl * (seltell + selr.d2) } 


return gradl, gradz 


def predict(self, X): 


def 


aly Zlpn Aly Zoy Go = SElt, Teecrorwarci(x, seliawl, Sselt.wZ) 
VY pred. = Npsargqmax(75, axis=0) 
GeLULM. Y pred 


fIt(SelLi; Kp Ve DEINE PrOgGress—ka.ise) z 
ScirscOck = |] 
X data, y data = X:copy(), yscopy() 


VY onc = Sell. sneCoce -abele (yy, Sele. CulpueT) 
delta Wl prev = Np. Zeros(selt.wil.~sStape) 
delta w2 prev = np.zeros(self.w2.shape) 


for 1 in range(self.epochs) : 


# adaptive learning rate 
self.eta /= (1 + self.decrease const*i) 


La: (DiGi. Progr eee: 
sys.stderr.write ( 
'\rEpoch: %d/%d' % (itl, self.epochs) ) 
sys.stderr.flush() 


if self.shuffle: 
10x = Dp«Yanoom.permutation(y data.shape [0] 
X Gata; Y Gata = xX davalicx), y cate |1dax| 


its = 0p.array Sspliti( range ( 
yy. date.shape (Ol), Sseli.mMinabartrcnes) 


hoe ee Aer: ya 


- Feeotrorward 


ely Gop Oly 2a, “> = Sel. eer ora 
X[1dx], self.wl, self.w2) 
COSt = sel. Gel. Costly enc-y enc|s, 26%), 


CUTPUL=a5, 

wl=self.wl, 

w2=self.w2) 
SeLEsCOst sappeno (Cost) 
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# compute gradient via backpropagation 

Otadl, GCracz = sell. Gel Grediventlal=al, aZz-az; 
a3=a3, Z22=22, 
Vy enc—-y Snel e, 10x), 
wl=self.wl, 
w2=self.w2) 


# update weights 
delta wl, delta w2 = self.eta * gradl,\ 

self.eta * gradz2 
seclt.wk == (Gelta wil = (selt.aipna ™ dcilta wi prey) ) 
SClit.w2.-> (delta w2Z 7 (selt«saloha * Gdelta wz prev) ) 
delta wl prev, delta w2 prev = delta wl, delta w2 


return self 


Now, let's initialize a new 784-50-10 MLP, a neural network with 784 input units (n_ features), 50 
hidden units (n hidden), and 10 output units (n_ output): 


>>> nn = NeuralNetMLP(n output=10, 
Mm Leatures=xX Erain.shape |i), 
i Weogen=)0, 
Lg dy 
Say 
epochs=1000, 
eta=0.001, 
elpna=0 3001, 
G6Greae> Cone = 07008, 
sShuffle=True, 
minibatches=50, 
fancom, -stavte=L) 


As you may have noticed, by going over our preceding MLP implementation, we also implemented 
some additional features, which are summarized here: 


e 12: The 4 parameter for L2 regularization to decrease the degree of overfitting; equivalently, 11 


is the 4 parameter for L] regularization. 
e epochs: The number of passes over the training set. 


e eta: The learning rate ’/. 
e alpha: A parameter for momentum learning to add a factor of the previous gradient to the weight 


_ Aw, = 9VI(w,)+aAw, , 7 | 

update for faster learning (where ! is the current time step or 
epoch). 

e decrease const: The decrease constant @ for an adaptive learning rate " that decreases over 


time for better convergence // +txa 


e shuffle: Shuffling the training set prior to every epoch to prevent the algorithm from getting 
stuck in cycles. 

® Minibatches: Splitting of the training data into & mini-batches 1n each epoch. The gradient is 
computed for each mini-batch separately astead.or the entire training data for faster learning. 


Next, we train the MLP using 60,000 samples from the already shuffled MNIST training dataset. 
Before you execute the following code, please note that training the neural network may take 10-30 
minutes on standard desktop computer hardware: 


Qo ete rei, 7 ein, Die Progceso= Lue) 
Epoch: 100071000 


Similar to our previous Adaline implementation, we save the cost for each epochina cost_ list that 
we can now visualize, making sure that the optimization algorithm reached convergence. Here, we 
only plot every 50th step to account for the 50 mini-batches (50 mini-batches < 1000 epochs). The 
code is as follows: 


eer DlesplouUlrange (veninnseCose: )), Ml<«COSst. 
Zor Plte.vyiam( (0, 2000).) 

Por Dilts Vlabet {(* COSste’ } 

Poe Plt. <lebel ("Bpochs * 50") 

27 DiLCeligit Layout () 

>>> plt.show() 


As we see in the following plot, the graph of the cost function looks very noisy. This is due to the fact 
that we trained our neural network with mini-batch learning, a variant of stochastic gradient descent. 
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Although we can already see in the plot that the optimization algorithm converged after approximately 
800 epochs (40,000/50 = S00), let's plot a smoother version of the cost function against the number of 
epochs by averaging over the mini-batch intervals. The code 1s as follows: 


oo DacCnes = TO.ediray spilt range (ee uae et st) ) jy L000) 


Pre COSt ely = NPsar ray (nn.cose ) 
por COSt avos = |NowMean(COst Ory ls:)) fOr 1 1h barccies | 


eer DiteDpIlOUlrange | Len (COst avgs).) , 
COSt_avgs, 

oes color='red') 

Poo PIE. Viam ClO; 2o00].) 

Por DEL«y label’ Cost") 

vor Dilet babe. Bpecis *) 

eer Dies ELGue Layout () 

Ze? Dil «ehow ©) 


The following plot gives us a clearer picture indicating that the training algorithm converged shortly 
after the 800th epoch: 
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Now, let's evaluate the performance of the model by calculating the prediction accuracy: 


2 Yo ero pred, = Mi precitcutx ~rain) 

>>> acc = np.sum(y train == y train pred, axis=0) / X train.shape[0] 
Per DEIN" Training accuracy: <ceZias” « (face *~ 100) 

TraAINGNG Bcocureacy: 97.74% 


As we can see, the model classifies most of the training digits correctly, but how does it generalize to 
data that it has not seen before? Let's calculate the accuracy on 10,000 images in the test dataset: 


Po ¥ CSest. pred = Mir. preg1 cl (xX Teese) 

>>> acc = np.sum(y test == y test pred, axis=0) / X test.shape[0] 
>>> print('Training accuracy: %.2fc%woqn dagg, * 1929)) 

Test accuracy: 96.18% www.wowebook.org 


Based on the small discrepancy between training and test accuracy, we can conclude that the model 
only slightly overfits the training data. To further fine-tune the model, we could change the number of 
hidden units, values of the regularization parameters, learning rate, values of the decrease constant, or 
the adaptive learning using the techniques that we discussed in Chapter 6, Learning Best Practices 
for Model Evaluation and Hyperparameter Tuning (this is left as an exercise for the reader). 


Now, let's take a look at some of the images that our MLP struggles with: 


>>> 
APP? 
27 


>>> 


ZO? 
>>> 


>>> 
>>> 
>>> 
>>> 


Miockh 109 = ~ Eesrly test t= VY resc pred! 320] 
COLrrect Jeo. = ~ tescly tesc J= “7 -test pred) (225, 
Misc. dJab= 7 Cese Dredly test i= yy tect pred! | 32a! 


fig, ax = plt.subplots(nrows=5, 
ncols=5, 
sharex=True, 
sharey=True, ) 


ax = ax.flatten() 
for 1 in range(25): 
Img = MPscl IM¢ (1) «resnape(Z6,. 2o) 


ax[i].imshow(img, 
cmap='Greys', 
interpolation='nearest'") 
axXiilJiJ.—S66t. Latle( ed) ts <d pe «a 
6 (i, oreo Jeb, Masel abel) 
ax[O].sSel. xticks([]:) 
ax(UllsSet VeErcks( [14 
PLEseLOnNe. Jayour) 
plt.show() 


We should now see a >*> subplot matrix where the first number in the subtitles indicates the plot 
index, the second number indicates the true class label (t), and the third number stands for the 
predicted class label (p). 
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As we can see 1n the preceding figure, some of those images are even challenging for us humans to 
classify correctly. For example, we can see that the digit 9 is classified as a 3 or 8 if the lower part of 


the digit has a hook-like curvature (subplots 3, 16, and 17). 
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Training an artificial neural network 


Now that we have seen a neural network in action and have gained a basic understanding of how it 
works by looking over the code, let's dig a little bit deeper into some of the concepts, such as the 
logistic cost function and the backpropagation algorithm that we implemented to learn the weights. 
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Computing the logistic cost function 


The logistic cost function that we implemented as the get cost method 1s actually pretty simple to 
follow since it is the same cost function that we described 1n the logistic regression section 1n 
Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. 


J ( W’) = -¥ y! log(a"” + { — y) log(I _ a’) 
io PN | | 


lb. : :; a oth 2s ; , 
Here, “ 1s the sigmoid activation of the ’ unit in one of the layers which we compute in the 
forward propagation step: 


a!) = 4(2) 


Now, let's add a regularization term, which allows us to reduce the degree of overfitting. As you 
will recall from earlier chapters, the L2 and L1 regularization terms are defined as follows 
(remember that we don't regularize the bias units): 


| mo 
pol, = ADP, 
ja 


- mo ; 
LZ A wl = A», we and LI=A 


j=l 











Although our MLP implementation supports both L1 and L2 regularization, we will now only focus on 
the L2 regularization term for simplicity. However, the same concepts apply to the L1 regularization 
term. By adding the L2 regularization term to our logistic cost function, we obtain the following 
equation: 


J Ww ) — bas log(a' )+(1 - y log (1 rm a) 64 


3 
“I 











MW’ 


Since we implemented an MLP for multi-class classification, this returns an output vector of ! 


elements, which we need to compare with the /* | dimensional target vector in the one-hot encoding 
: . OW! eBoo 
representation. For example, the activation toathis er and the target class (here: class 2) fora 


particular sample may look like this: 


0.1 0 

; 0.9 | 
ag =| . |, pe 

ie 0 


Thus, we need to generalize the logistic cost function to all activation units / in our network. So our 
cost function (without the regularization term) becomes: 


nt | | ik | n 
=—> > ye log(a\”)+ (I —y" )log(1 —a' ) 


i=l k=l 


Here, the superscript ! is the index of a particular sample in our training set. 


The following generalized regularization term may look a little bit complicated at first, but here we 


are just calculating the sum of all weights of a layer l (without the bias term) that we added to the 
first column: 


n-ne 
() 


l=] i=l j=l 


E=-1 wi wi 


— 


bo ! a 


The following equation represents the L2-penalty term: 


f=] wf w+! 


> > (wi?) 


f=] i=) j=l 
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a . df (lw 
Remember that our goal 1s to minimize the cost function () . Thus, we need to calculate the partial 


derivative of matrix ” with respect to each weight for every layer 1n the network: 


O 
au) 
ji 


— 


In the next section, we will talk about the backpropagation algorithm, which allows us to calculate 
these partial derivatives to minimize the cost function. 


Note that ” consists of multiple matrices. In a multi-layer perceptron with one hidden unit, we have 


(1 2. 
the weight matrix J which connects the input to the hidden layer, and ee , which connects the 


hidden layer to the output layer. An intuitive visualization of the matrix W is provided in the 
following figure: 
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Output units (rows) 





: (1) (=) 
In this simplified figure, it may seem that both ”° and "~~ have the same number of rows and 
columns, which ts typically not the case unless we initialize an MLP with the same number of hidden 
units, output units, and input features. 


If this may sound confusing, stay tuned for the next section where we will discuss the dimensionality 


(1) i). aa 4 : 
of W™ and W” in more detail in the context of the backpropagation algorithm. 
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Training neural networks via backpropagation 


In this section, we will go through the math of backpropagation to understand how you can learn the 
weights in a neural network very efficiently. Depending on how comfortable you are with 
mathematical representations, the following equations may seem relatively complicated at first. Many 
people prefer a bottom-up approach and like to go over the equations step by step to develop an 
intuition for algorithms. However, 1f you prefer a top-down approach and want to learn about 
backpropagation without all the mathematical notations, I recommend you to read the next section 
Developing your intuition for backpropagation first and revisit this section later. 


In the previous section, we saw how to calculate the cost as the difference between the activation of 
the last layer and the target class label. Now, we will see how the backpropagation algorithm works 
to update the weights 1n our MLP model, which we implemented in the get gradient method. As 
we recall from the beginning of this chapter, we first need to apply forward propagation in order to 
obtain the activation of the output layer, which we formulated as follows: 


(wets) alll | Fe nits. aT a 
Z\’ =H A ; (net input of the hidden layer ) 


A”) = ¢(Z°) (activation of the hidden layer ) 
Z°) = ZA”) (net input of the output layer ) 
A) = g(Z el (activation of the output layer ) 


Concisely, we just forward propagate the input features through the connection in the network as 
shown here: 
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In backpropagation, we propagate the error from right to left. We start by calculating the error vector 
of the output layer: 


6} =q) - y 


Here, ’ is the vector of the true class labels. 


Next, we calculate the error term of the hidden layer: 


52) — (we) 52) 2”) 


92°) 


ag( 2") 


a2) . ee —— : ; 
Here, “ is simply the derivative of the sigmoid activation function, which we implemented as 


A=) _(40 (1-0) 


| 
at 


Sigmoid. Gradient. 
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ok 
Note that the asterisk symbol (*) means element-wise multiplication 1n this context. 


Note 


Although, it 1s not important to follow the next equations, you may be curious as to how I obtained the 
derivative of the activation function. I summarized the derivation step by step here: 


$(2)=5 7 i 








[nd 
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=a(l—a) 


ol) es . 
To better understand how we compute the @ ~ term, let's walk through it in more detail. In the 
( yey" i 
preceding equation, we multiplied the transpose ~ of the '*/ dimensional matrix ™ ; ¢ is 


i pyr l2) A 
the number of output class labels and / 1s the number of hidden units). Now, | | becomes an 


. . , ~( =) Ca a ; ; . 
Nxt dimensional matrix with © , whichis a /* | dimensional vector. We then performed a pair- 


q? *(1 — e) 


| riz] +( 3) 
wise multiplication between | | and ' ! which is also a *! dimensional 


vector. Eventually, after obtaining the © terms, we can now write the derivation of the cost function 
as follows: 


am 
O 


(1) - 
i,j 





I(W)=a\5" 


Ov 


Next, we need to accumulate the partial derivative of every - th node in layer ! and the ! th error of 


the node in layer /+!: 


HV aff, (ict) 
AY =A, +a; 2, 


a(t) 
Remember that we need to compute “'’ for every sample in the training set. Thus, it is easier to 
implement it as a vectorized version like in our preceding MLP code implementation: 


AD = A454) 


After we have accumulated the partial derivatives, we can add the regularization term as follows: 
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AM Ala gO (except for the bias term) 


Lastly, after we have computed the gradients, we can now update the weights by taking an opposite 
step towards the gradient: 


Ww) Ww pa” 


To bring everything together, let's summarize backpropagation 1n the following figure: 
2 3(W)=aat" 
Ow; j (error term of the output layer) 


‘fomadie gradient) _ 5( 3) Pe 
| = 


Inputx | ks : ~ <¥ Output y 4— target y 


oh SL *y 
f ~~ 
, ag(2”) 
ks ___ 


az\?} 


(error term of the hidden layer) 
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Developing your intuition for backpropagation 


Although backpropagation was rediscovered and popularized almost 30 years ago, it still remains 
one of the most widely used algorithms to train artificial neural networks very efficiently. In this 
section, we'll see a more intuitive summary and the bigger picture of how this fascinating algorithm 
works. 


In essence, backpropagation 1s just a very computationally efficient approach to compute the 
derivatives of a complex cost function. Our goal 1s to use those derivatives to learn the weight 
coefficients for parameterizing a multi-layer artificial neural network. The challenge in the 
parameterization of neural networks 1s that we are typically dealing with a very large number of 
weight coefficients in a high-dimensional feature space. In contrast to other cost functions that we 
have seen in previous chapters, the error surface of a neural network cost function is not convex or 
smooth. There are many bumps in this high-dimensional cost surface (local minima) that we have to 
overcome in order to find the global minimum of the cost function. 


You may recall the concept of the chain rule from your introductory calculus classes. The chain rule 1s 


, (e(x))=y 

an approach to deriving a complex, nested function, for example, J (§ ( ) * thatis broken down 
into basic components: 

Oy _ of og 


ox dog ox 


In the context of computer algebra, a set of techniques has been developed to solve such problems 
very efficiently, which is also known as automatic differentiation. If you are interested in learning 
more about automatic differentiation in machine learning applications, I recommend you to refer to the 
following resource: A. G. Baydin and B. A. Pearlmutter. Automatic Differentiation of Algorithms for 
Machine Learning. arXiv preprint arXiv: 1404.7456, 2014, which 1s freely available on arXiv at 
http://arxiv.org/pdt/1404.7456.pdf. 


Automatic differentiation comes with two modes, the forward and the reverse mode, respectively. 
Backpropagation is simply just a special case of the reverse-mode automatic differentiation. The key 
point is that applying the chain rule in the forward mode can be quite expensive since we would have 
to multiply large matrices for each layer (Jacobians) that we eventually multiply by a vector to obtain 
the output. The trick of the reverse mode is that we start from right to left: we multiply a matrix by a 
vector, which yields another vector that 1s multiplied by the next matrix and so on. Matrix-vector 
multiplication is computationally much cheaper than matrix-matrix multiplication, which 1s why 
backpropagation is one of the most popular algorithms used in neural network training. 
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Debugging neural networks with gradient 
checking 


Implementations of artificial neural networks can be quite complex, and it 1s always a good idea to 
manually check that we have implemented backpropagation correctly. In this section, we will talk 
about a simple procedure called gradient checking, whichis essentially a comparison between our 
analytical gradients in the network and numerical gradients. Gradient checking is not specific to 
feedforward neural networks but can be applied to any other neural network architecture that uses 
gradient-based optimization. Even if you are planning to implement more trivial algorithms using 
gradient-based optimization, such as linear regression, logistic regression, and support vector 
machines, it is generally not a bad idea to check if the gradients are computed correctly. 


. . . J (Ke) . . . 
In the previous sections, we defined a cost function i) where ” is the matrix of the weight 


. ee UH) . . 
coefficients of an artificial network. Note that vm) is—roughly speaking—a "stacked" matrix 


(1 2 
consisting of the matrices 4 and W"” ina multi-layer perceptron with one hidden unit. We defined 


W"” as the ne [mn = 7 -dimensional matrix that connects the input layer to the hidden layer, where h 


is the number of hidden units and /”' 1s the number of features (input units). The matrix Ww that 


connects the hidden layer to the output layer has the dimensions ! * N where / is the number of output 
(i) 
e e e e e l f ‘ 
units. We then calculated the derivative of the cost function for a weight "J as follows: 


O 
aun) 


iJ 


Remember that we are updating the weights by taking an opposite step towards the direction of the 
eradient. In gradient checking, we compare this analytical solution to a numerically approximated 
eradient: 


i] I) 
5 P ( vi , ; + é | — J ( 7 4 
J (W ) x 
aA!) : 
OW. & 
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Here, © 1s typically a very small number, for example le-5 (note that le-5 is just a more convenient 
notation for 0.00001). Intuitively, we can think of this finite difference approximation as the slope of 
the secant line connecting the points of the cost function for the two weights ” and “*# (bothare 
scalar values), as shown in the following figure. We are omitting the superscripts and subscripts for 


simplicity. 


J(w = 0.1) 


J(w = 0.1 + €) 


* J(w = 0.1) - J{w = 0.1 + €) 





An even better approach that yields a more accurate approximation of the gradient is to compute the 
symmetric (or centered) difference quotient given by the two-point formula: 


(wl!) +2)—a(w!? -2) 
Zé 


Y 


Typically, the approximated difference between the numerical gradient J and analytical gradient 


Y 


J is then calculated as the L2 vector norm. For practical reasons, we unroll the computed gradient 
matrices into flat vectors so that we can calculate the error (the difference between the gradient 
vectors) more conveniently: 








error = J re ae 
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One problem 1s that the error is not scale invariant (small errors are more significant 1f the weight 
vector norms are small too). Thus, it is recommended to calculate a normalized difference: 


|J',- J 








relative error = i 
lJ | +||J a 


| 











Now, we want the relative error between the numerical gradient and the analytical gradient to be as 
small as possible. Before we implement gradient checking, we need to discuss one more detail: what 
is the acceptable error threshold to pass the gradient check? The relative error threshold for passing 
the gradient check depends on the complexity of the network architecture. As a rule of thumb, the 
more hidden layers we add, the larger the difference between the numerical and analytical gradient 
can become if backpropagation is implemented correctly. Since we have implemented a relatively 
simple neural network architecture in this chapter, we want to be rather strict about the threshold and 
define the following rules: 


e Relative error <= le-7 means everything 1s okay! 
e Relative error <= le-4 means the condition 1s problematic, and we should look into it. 
e Relative error > le-4 means there 1s probably something wrong 1n our code. 


Now we have established these ground rules, let's implement gradient checking. To do so, we can 
simply take the NeuralNetMLP Class that we implemented previously and add the following method 
to the class body: 


oer Oraedtent Checking (selt, xX, yEnc, wl, 
w2, epsilon, gradl, grad2): 
mW Apply gradient checking (for debugging only) 


RELCUrnAs 
felabive Grror 2 Eloar 
Relative error between the numerically 
approximated gradients and the backpropagated gradients. 


wove vy 


Dum Oreadl = 1p.2eros (np.snape iw) ) 
Spsilon_iatyl = fp.2Zero0s (np.shape (WL)? 
for 1 in range(wl.shape[0O]): 

for j in range(wl.shape[1]): 


epson ervyiliy 7] = epetton 
aie; 22y 22; Zon Bo = SelLi. Teecdrorvaia, 
Xy 
WL = @psi1b0n, aryl, 
WwOW! eBook’ 
COstl = Selr -_get_cost(y, Gewebook.org 


as; 
WlL-Spse1. On, aryl, 


w2 ) 
ely. 2Z2Zy GZy Zo, Go — sell. Tecdrorvarc( 
Xy 
Wl 3 €psilon aryl; 
w2 ) 
Costs = Seiit.4 Gel Costly enc, 
a3, 
wl + epsilon aryl, 
Ww2 ) 
Hum. GOradlilay, 37] = (cost2 = cCostl) 7 (2 * Spsilon) 
epsiton aryl ia; 7) = Vv 
num gradz = np.zeros(np.shape(w2) ) 
epsilon. ary2 = 1p.Zeros (np.shape(w2) } 
for 1 in range(w2.shape[0]): 
for jJ in range(w2.shape[1]): 
epsi.t0n aryZla, 7] = epsiton 
Gly 2p Oly, Zoe Go = Sel. eet ora a 
xX, 
wl, 
WZ = Colon, iy) 
COsti, = Sells Gel Cosel enc, 
a3, 
wl, 
WZ = €pSi1Onm. ary?) 
al, 2ZZy Gly 2Zog Go = SelLts Peed orvarci 
xX, 
wl, 
WZ.  @pst lon. aryZ) 
COstz = Sell. Gel Cosuly enc; 
a3, 
wl, 
WZ + Epsilon ary2Z) 
num grad2[i, Jj] = (cost2 - costl) / (2 * epsilon) 


epsilon aryZii, 3) = 2 


num Grad = np.Nstack( (num. oOredl.tleatten(), 
Wit Gra0Z 1 allen () ).) 
grad = np.hstack((gradl.flatten(), grad2.flatten())) 


Dorm): = 1p.lAanalo.Morm (mum Grad — grad) 
HOrMZ “= Dp. iia lg.norm (num Oread) 

norms = Np; Line l¢G.norm( grad) 

relative error = norml / (norm2 + norm3) 


PSCUrn. Pelarive error 


The gradient checking code seems rather simple. However, my personal recommendation is to 
keep it as simple as possible. Our goal 1s to double-check the gradient computation, so we want to 
make sure that we do not introduce any additional mistakes in gradient checking by writing efficient 
but complex code. Next, we only need to make a small modification to the fit method. In the 
following code, I omitted the code at the beginning of the £it function for clarity, and the only lines 
that we need to add to the method are implemetited HewW¥een the comments ## start gradient 


W.WOWeDOO 


checking and ## end gradient checking: 


class MLPGradientCheck (object): 
Leow 
(62 Die tecin, xy. Vy Prue. Prog eso —raloe) : 
pS. 2. 
# compute gradient via backpropagation 
Gradil,; Graqz = self. Get. Oradienr 
ae. 
a2=a2, 
a3=a3, 
Z2Z=Z2, 
y ene=) -CuCcit, 26x), 
wl=self.wl, 
w2=self.w2) 


## Start gradient checking 


Grad O1ft = seli. Gradient Checking | 
X=X[1idx], 
Y enC-7 ea la, tax], 
wl=self.wl, 
w2=self.w2, 
epsilon=le-5, 
gradl=gradl, 
grad2=grad2) 
i Obed, tt. <== 16-7. 
Princ(’OK. <s* © Grad ‘G1 ff) 
Clit Ofa0, Gir. ~~] 16-8: 
Prige( Wernang:. <s* = Grad cif) 
else: 
PIane(*PROBUEMS 26” « Grad citT) 


## end gradient checking 


# update weights; [alpha * delta w prev] 
# for momentum learning 

Cee. Wil = Selisera © Orac 

Celta WZ = seli.eta * GracZ 


self.wl -= (delta wl +\ 
(SCLEndlpnea “ delta wil prev) ) 
self.w2 -= (delta w2 +\ 


(selt.alohna * Ge6lta w2Z prev):) 
Gelta.wl prev = Gelta wil 
delta w2 prev = celta w2 


return self 


Assuming that we named our modified multi-layer perceptron class MLPGradientCheck, we can now 
initialize a new MLP with 10 hidden layers. Also, we disable regularization, adaptive learning, and 
momentum learning. In addition, we use regular gradient descent by setting minibatches to 1. The 


code 1s as follows: 
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Poo Tn. Check = MLPGradlentCheck(n Outpur=10, 
nn Fe6atures-xX Lrail«sShape [i iy 
n hidden=10, 
1Z=0.0y 
110.0; 
epochs=10, 
eta=0.001, 
alpha=0.0, 
CSC rease. Const—U.0, 
minibatches=l1, 
random state=1) 


One downside of gradient checking is that it is computationally very, very expensive. Training a 
neural network with gradient checking enabled 1s so slow that we really only want to use it for 
debugging purposes. For this reason, it 1s not uncommon to run gradient checking only on a handful of 
training samples (here, we choose 5). The code 1s as follows: 


Pro MW -CHeCCK,°11U(xX tieimt oly VY balm iol, print progress=rale¢) 


Okt Z2.5607IZ9362416E-10 
Ok: 2.94603251069e-10 
Oke 2eo Ol oo2Z0Z 1.2 = 1) 
Oks 24454969423 2266E-10 
Ok? 32376720 (31506-10 
Ok: 3.603466384é6le-10 
Oke 2.2249 72120 7656410 
Okt 223365 /Ue4506—-10 
Ok: 3.4465368655le-10 
Oke 2.716170 7211610 


As we can see from the code output, our multi-layer perceptron passes this test with excellent results. 
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Convergence in neural networks 


You might be wondering why we did not use regular gradient descent but mini-batch learning to train 
our neural network for the handwritten digit classification. You may recall our discussion on 
stochastic gradient descent that we used to implement online learning. In online learning, we compute 


. -_ (A=1 . 
the gradient based on a single training example =) at a time to perform the weight update. 
Although this is a stochastic approach, it often leads to very accurate solutions with a much faster 
convergence than regular gradient descent. Mini-batch learning 1s a special form of stochastic 


eradient descent where we compute the gradient based on a subset K of the ” training samples with 


|<k <M Mini-batch learning has the advantage over online learning that we can make use of our 
vectorized implementations to improve computational efficiency. However, we can update the 
weights much faster than in regular gradient descent. Intuitively, you can think of mini-batch learning 
as predicting the vote turnout of a presidential election from a poll by asking only a representative 
subset of the population rather than asking the entire population. 


In addition, we added more tuning parameters such as the decrease constant and a parameter for an 
adaptive learning rate. The reason 1s that neural networks are much harder to train than simpler 
algorithms such as Adaline, logistic regression, or support vector machines. In multi-layer neural 
networks, we typically have hundreds, thousands, or even billions of weights that we need to 
optimize. Unfortunately, the output function has a rough surface and the optimization algorithm can 
easily become trapped in local minima, as shown in the following figure: 


local 
cost minimum 


/ Global 


cost minimum 





Note that this representation is extremely simplified since our neural network has many dimensions; it 
makes it impossible to visualize the actual cost surface for the human eye. Here, we only show the 
cost surface for a single weight on the x axis. However, the main message 1s that we do not want our 
algorithm to get trapped in local minima. By igergaspagithe learning rate, we can more readily escape 
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such local minima. On the other hand, we also increase the chance of overshooting the global 
optimum if the learning rate is too large. Since we initialize the weights randomly, we start with a 
solution to the optimization problem that is typically hopelessly wrong. A decrease constant, which 
we defined earlier, can help us to climb down the cost surface faster in the beginning and the adaptive 
learning rate allows us to better anneal to the global minimum. 
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Other neural network architectures 


In this chapter, we discussed one of the most popular feedforward neural network representations, the 
multi-layer perceptron. Neural networks are currently one of the most active research topics in the 
machine learning field, and there are many other neural network architectures that are well beyond the 
scope of this book. If you are interested in learning more about neural networks and algorithms for 
deep learning, I recommend reading the introduction and overview; Y. Bengio. Learning Deep 
Architectures for AI. Foundations and Trends in Machine Learning, 2(1):1—127, 2009. Yoshua 
Bengio's book 1s currently freely available at 


http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf. 


Although neural networks really are a topic for another book, let's take at least a brief look at two 
other popular architectures, convolutional neural networks and recurrent neural networks. 
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Convolutional Neural Networks 


Convolutional Neural Networks (CNNs or ConvNets) gained popularity in computer vision due to 
their extraordinary good performance on image classification tasks. As of today, CNNs are one of the 
most popular neural network architectures in deep learning. The key idea behind convolutional neural 
networks is to build many layers of feature detectors to take the spatial arrangement of pixels 1n an 
input image into account. Note that there exist many different variants of CNNs. In this section, we 
will discuss only the general idea behind this architecture. If you are interested in learning more about 
CNNs, I recommend you to take a look at the publications of Yann LeCun (http://yann.lecun.com), 
who is one of the co-inventors of CNNs. In particular, I can recommend the following literature for 
getting started with CNNs: 


e Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based Learning Applied to Document 
Recognition. Proceedings of the IEEE, 86(11):2278—2324, 1998. 

e P. Y. Simard, D. Steinkraus, and J. C. Platt. Best Practices for Convolutional Neural Networks 
Applied to Visual Document Analysis. IEEE, 2003, p.958. 


As you will recall from our multi-layer perceptron implementation, we unrolled the images into 
feature vectors and these inputs were fully connected to the hidden layer—spatial information was not 
encoded in this network architecture. In CNNs, we use receptive fields to connect the input layer to a 
feature map. These receptive fields can be understood as overlapping windows that we slide over the 
pixels of an input image to create a feature map. The stride lengths of the window sliding as well as 
the window size are additional hyperparameters of the model that we need to define a priori. The 
process of creating the feature map is also called convolution. An example of such a convolutional 
layer, the layer that connects the input pixels to each unit in the feature map, is shown in the following 
figure: 


input image feature maps 








It is important to note that the feature detectors are replicates, which means that the receptive fields 
that map the features to the units in the next layer share the same weights. Here, the key idea is that ifa 
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feature detector 1s useful in one part of the image, it might be useful in another part as well. The nice 
side effect of this approach is that it greatly reduces the number of parameters that need to be learned. 
Since we allow different patches of the image to be represented in different ways, CNNs are 
particularly good at recognizing objects of different sizes and different positions in an image. We do 
not need to worry so much about rescaling and centering the images as it has been done in MNIST- 


In CNNs, a convolutional layer is followed by a pooling layer (sometimes also called sub- 
sampling). In pooling, we summarize neighboring feature detectors to reduce the number of features 
for the next layer. Pooling can be understood as a simple method of feature extraction where we take 
the average or maximum value of a patch of neighboring features and pass it on to the next layer. To 
create a deep convolutional neural network, we stack multiple layers—alternating between 
convolutional and pooling layers—before we connect it to a multi-layer perceptron for classification. 
This 1s shown in the following figure: 


convolutional layer pooling layer fully connected MLP 


output label 


raams: feature maps classifier 
input image 





feature maps 
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Recurrent Neural Networks 


Recurrent Neural Networks (RNNs) can be thought of as feedforward neural networks with 
feedback loops or backpropagation through time. In RNNs, the neurons only fire for a limited amount 
of time before they are (temporarily) deactivated. In turn, these neurons activate other neurons that 
fire at a later point in time. Basically, we can think of recurrent neural networks as MLPs with an 
additional time variable. The time component and dynamic structure allows the network to use not 
only the current inputs but also the inputs that it encountered earlier. 


recurrence 





input layer hidden layer output layer 


Although RNNs achieved remarkable results in speech recognition, language translation, and 
connected handwriting recognition, these network architectures are typically much harder to train. 
This 1s because we cannot simply backpropagate the error layer by layer; we have to consider the 
additional time component, which amplifies the vanishing and exploding gradient problem. In 1997, 
Juergen Schmidhuber and his co-workers introduced the so-called long short-term memory units to 
overcome this problem: Long Short Term Memory (LSTM) units; S. Hochreiter and J. 
Schmidhuber. Long Short-term Memory. Neural Computation, 9(8):1735—1780, 1997. 


However, we should note that there are many different variants of RNNs, and a detailed discussion 1s 
beyond the scope of this book. 
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A few last words about neural network 
implementation 


You might be wondering why we went through all of this theory just to implement a simple multi-layer 
artificial network that can classify handwritten digits instead of using an open source Python machine 
learning library. One reason 1s that at the time of writing this book, scikit-learn does not have an MLP 
implementation. More importantly, we (machine learning practitioners) should have at least a basic 
understanding of the algorithms that we are using 1n order to apply machine learning techniques 
appropriately and successfully. 


Now that we know how feedforward neural networks work, we are ready to explore more 
sophisticated Python libraries built on top of NumPy such as Theano 
(http://deeplearning.net/software/theano/), which allows us to construct neural networks more 
efficiently. We will see this in Chapter 13, Parallelizing Neural Network Training with Theano. 
Over the last couple of years, Theano has gained a lot of popularity among machine learning 
researchers, who use it to construct deep neural networks because of its ability to optimize 
mathematical expressions for computations on multi-dimensional arrays utilizing Graphical 
Processing Units (GPUs). 


A great collection of Theano tutorials can be found at 


http://deeplearning.net/software/theano/tutorial/index. html#tutorial. 


There are also a number of interesting libraries that are being actively developed to train neural 
networks in Theano, which you should keep on your radar: 


e Pylearn? (http://deeplearning.net/software/pylearn2/) 
e Lasagne (https://lasagne.readthedocs.org/en/latest/) 
e Keras (http://keras.io) 
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Summary 


In this chapter, you have learned about the most important concepts behind multi-layer artificial 

neural networks, which are currently the hottest topic in machine learning research. In Chapter 2, 
Training Machine Learning Algorithms for Classification, we started our journey with simple 
single-layer neural network structures and now we have connected multiple neurons to a powerful 
neural network architecture to solve complex problems such as handwritten digit recognition. We 
demystified the popular backpropagation algorithm, which is one of the building blocks of many 
neural network models that are used 1n deep learning. After learning about the backpropagation 
algorithm, we were able to update the weights of such a complex neural network. We also added 
useful modifications such as mini-batch learning and an adaptive learning rate that allows us to train a 
neural network more efficiently. 
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Chapter 13. Parallelizing Neural Network 
Training with Theano 


In the previous chapter, we went over a lot of mathematical concepts to understand how feedforward 
artificial neural networks and multilayer perceptrons in particular work. First and foremost, having a 
good understanding of the mathematical underpinnings of machine learning algorithms 1s very 
important, since it helps us to use those powerful algorithms most effectively and correctly. 
Throughout the previous chapters, you dedicated a lot of time to learning the best practices of 
machine learning, and you even practiced implementing algorithms yourself from scratch. In this 
chapter, you can lean back a little bit and rest on your laurels, I want you to enjoy this exciting journey 
through one of the most powerful libraries that 1s used by machine learning researchers to experiment 
with deep neural networks and train them very efficiently. Most of modern machine learning research 
utilizes computers with powerful Graphics Processing Units (GPUs). If you are interested in diving 
into deep learning, which is currently the hottest topic 1n machine learning research, this chapter is 
definitely for you. However, do not worry if you do not have access to GPUs; in this chapter, the use 
of GPUs will be optional, not required. 


Before we get started, let me give you a brief overview of the topics that we will cover in this 
chapter: 


e Writing optimized machine learning code with Theano 
e Choosing activation functions for artificial neural networks 
e Using the Keras deep learning library for fast and easy experimentation 
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Building, compiling, and running expressions 
with Theano 


In this section, we will explore the powerful Theano tool, which has been designed to train machine 
learning models most effectively using Python. The Theano development started back 1n 2008 1n the 
LISA lab (short for Laboratoire d'Informatique des Systémes Adaptatifs 


(http://lisa.iro.umontreal.ca)) lead by Yoshua Bengio. 


Before we discuss what Theano really is and what it can do for us to speed up our machine learning 
tasks, let's discuss some of the challenges when we are running expensive calculations on our 
hardware. Luckily, the performance of computer processors keeps on improving constantly over the 
years, which allows us to train more powerful and complex learning systems to improve the 
predictive performance of our machine learning models. Even the cheapest desktop computer 
hardware that is available nowadays comes with processing units that have multiple cores. In the 
previous chapters, we saw that many functions 1n scikit-learn allow us to spread the computations 
over multiple processing units. However, by default, Python is limited to execution on one core, due 
to the Global Interpreter Lock (GIL). However, although we take advantage of its 
multiprocessing library to distribute computations over multiple cores, we have to consider that 
even advanced desktop hardware rarely comes with more than 8 or 16 such cores. 


If we think back of the previous chapter where we implemented a very simple multilayer perceptron 
with only one hidden layer consisting of 50 units, we already had to optimize approximately 1000 
weights to learn a model for a very simple image classification task. The images 1n MNIST are rather 
small (28 x 28 pixels), and we can only imagine the explosion in the number of parameters if we want 
to add additional hidden layers or work with images that have higher pixel densities. Such a task 
would quickly become unfeasible for a single processing unit. Now, the question 1s how can we 
tackle such problems more effectively? The obvious solution to this problem is to use GPUs. GPUs 
are real power horses. You can think of a graphics card as a small computer cluster inside your 
machine. Another advantage 1s that modern GPUs are relatively cheap compared to the state-of-the- 
art CPUs, as we can see in the following overview: 





Extreme Edition 
Base Clock Frequency 3.0 GHz 1.0 GHz 
Cores 8 2816 
Memory Bandwidth 68 GB/s 336.5 GB/s 
Floating-Point Calculations 354 GFLOPS 5632 GFLOPS 
Cost $1000.00 $700.00 
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Sources for this can be found on the following websites: 


@ tp: LEW WwW. geforce. convhardware/desktop-gpus/ geforce-gtx-980-ti/ specifications 





Cache- -up-to-3_50-GHz 


(date: August 20, 2015) 


At 70 percent of the price of a modern CPU, we can get a GPU that has 450 times more cores, and 1s 
capable of around 15 times more floating-point calculations per second. So, what is holding us back 
from utilizing GPUs for our machine learning tasks? The challenge is that writing code to target GPUs 
is not as trivial as executing Python code 1n our interpreter. There are special packages such as 
CUDA and OpenCL that allow us to target the GPU. However, writing code in CUDA or OpenCLis 
probably not the most convenient environment for implementing and running machine learning 
algorithms. The good news 1s that this is what Theano was developed for! 
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What is Theano? 


What exactly 1s Theano—a programming language, a compiler, or a Python library? It turns out that it 
fits all these descriptions. Theano has been developed to implement, compile, and evaluate 
mathematical expressions very efficiently with a strong focus on multidimensional arrays (tensors). It 
comes with an option to run code on CPU(s). However, its real power comes from utilizing GPUs to 
take advantage of the large memory bandwidths and great capabilities for floating point math. Using 
Theano, we can easily run code in parallel over shared memory as well. In 2010, the developers of 
Theano reported an |.8x faster performance than NumPy when the code was run on the CPU, and if 
Theano targeted the GPU, it was even 11x faster than NumPy (J. Bergstra, O. Breuleux, F. Bastien, P. 
Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: A CPU and 
GPU Math Compiler in Python. In Proc. 9th Python in Science Conf, pages 1—7, 2010.). Now, keep 
in mind that this benchmark is from 2010, and Theano has improved significantly over the years, and 
so have the capabilities of modern graphics cards. 


So, how does Theano relate to NumPy? Theano is built on top of NumPy and it has a very similar 
syntax, which makes the usage very convenient for people who are already familiar with the latter. To 
be fair, Theano is not just '"NumPy on steroids" as many people would describe it, but it also shares 
some similarities with SymPy (http://www.sympy.org), a Python package for symbolic computations 
(or symbolic algebra). As we saw 1n previous chapters, 1n NumPy, we describe what our variables 
are, and how we want to combine them; then, the code 1s executed line by line. In Theano, however, 
we write down the problem first and the description of how we want to analyze it. Then, Theano 
optimizes and compiles code for us using C/C++, or CUDA/OpenCL if we want to run it on the GPU. 
In order to generate the optimized code for us, Theano needs to know the scope of our problem; think 
of it as a tree of operations (or a graph of symbolic expressions). Note that Theano is still under 
active development, and many new features are added and improvements are made on a regular basis. 
In this chapter, we will explore the basic concepts behind Theano and learn how to use it for machine 
learning tasks. Since Theano 1s a large library with many advanced features, it would be impossible 
to cover all of them in this book. However, I will provide useful links to the excellent online 
documentation (http://deeplearning.net/software/theano/) if you want to learn more about this library. 
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First steps with Theano 


In this section, we will take our first steps with Theano. Depending on how your system is set up, you 
typically can just use the pip installer and install Theano from PyPI by executing the following from 
your command-line terminal: 


pip install Theano 


If you should experience problems with the installation procedure, I recommend you to read more 
about system and platform-specific recommendations that are provided at 
http://deeplearning.net/software/theano/install.html. Note that all the code 1n this chapter can be run 
on your CPU; using the GPU 1s entirely optional but recommended 1f you fully want to enjoy the 
benefits of Theano. If you have a graphics card that supports either CUDA or OpenCL, please refer to 


the up-to-date tutorial at http://deeplearning.net/software/theano/tutorial/using gpu.html#using-gpu to 
set it up appropriately. 


At its core, Theano 1s built around so-called tensors to evaluate symbolic mathematical expressions. 
Tensors can be understood as a generalization of scalars, vectors, matrices, and so on. More 
concretely, a scalar can be defined as a rank-O tensor, a vector as a rank-1 tensor, a matrix as rank-2 
tensor, and matrices stacked in a third dimension as rank-3 tensors. As a warm-up exercise, we will 
start with the use of simple scalars from the Theano tensor module to compute a net input = ofa 


sample point -* in a one dimensional dataset with weight and bias '?: 


Z=xXxXwi+w, 


The code 1s as follows: 


>>> import theano 
>>> from theano import tensor as T 


See Lae 


>>> xl = T.scalar() 
>>> wl = T.scalar () 
>>> wO = T.scalar() 


>>> zl = wl * xl + wO0 


# compile 
27 Wet AINpuL = tCheano. function (i nputse=(wi, xl, wOl, 
outputs=zZ1) 


# execute 
eo Die Nek thou. @e2” + Dee PepUli 2s, 2sdy Yeo). 
Net input: 2.50 


This was pretty straightforward, right? If we write eotUin Theano, we just have to follow three 


w.wowe Org 


simple steps: define the symbols (variable objects), compile the code, and execute it. In the 
initialization step, we defined three symbols, x1, wi, and w0, to compute z1. Then, we compiled a 
function net input to compute the net input z1. 


However, there is one particular detail that deserves special attention if we write Theano code: the 
type of our variables (dt ype). Consider it as a blessing or burden, but in Theano we need to choose 
whether we want to use 64 or 32 bit integers or floats, which greatly affects the performance of the 
code. Let's discuss those variable types 1n more detail in the next section. 
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Configuring Theano 


Nowadays, no matter whether we run Mac OS X, Linux, or Microsoft Windows, we mainly use 
software and applications using 64-bit memory addresses. However, if we want to accelerate the 
evaluation of mathematical expressions on GPUs, we still often rely on the older 32-bit memory 
addresses. Currently, this 1s the only supported computing architecture in Theano. In this section, we 
will see how to configure Theano appropriately. If you are interested in more details about the 
Theano configuration, please refer to the online documentation at 


http://deeplearning.net/software/theano/library/config. html. 


When we are implementing machine learning algorithms, we are mostly working with floating point 
numbers. By default, both NumPy and Theano use the double-precision floating-point format 
(f£loat64). However, it would be really useful to toggle back and forth float64 (CPU), and 
float32 (GPU) when we are developing Theano code for prototyping on CPU and execution on 
GPU. For example, to access the default settings for Theano's float variables, we can execute the 
following code in our Python interpreter: 


o> Print (Cheand.conrti1g.TfloStx) 
floato4 


If you have not modified any settings after the installation of Theano, the floating point default should 
be float64. However, we can simply change it to £loat32 1n our current Python session via the 
following code: 


Por wNeano«.COntig,tloark = *tloarsZ’ 


Note that although the current GPU utilization in Theano requires float32 types, we can use both 
floaté64 and float32 on our CPUs. Thus, if you want to change the default settings globally, you can 
change the settings in your THEANO FLAGS variable via the command-line (Bash) terminal: 


export THEANO. FPLAGS=floatxz-flLoatjZ 


Alternatively, you can apply these settings only to a particular Python script, by running it as follows: 


IBEANO PlLAGS=lloOatx=LlOatsZ Python. your sSCcripu.py 


So far, we discussed how to set the default floating-point types to get the best bang for the buck on our 
GPU using Theano. Next, let's discuss the options to toggle between CPU and GPU execution. If we 
execute the following code, we can check whether we are using CPU or GPU: 


>>> print (theano.config.device) 
Cpu 


My personal recommendation is to use cpu as default, which makes prototyping and code debugging 
easier. For example, you can run Theano code on your CPU by executing it a script, as from your 


command-line terminal: 
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THEANO FILAGS=devVlLCe=Cpu;, tloatxX=—loato4 python your Scripl.py 


However, once we have implemented the code and want to run it most efficiently utilizing our GPU 
hardware, we can then run it via the following code without making additional modifications to our 


original code: 


THEANO FLAGS=dev1ce-gpu,tloatx=floatsZz Python your scripl.py 


It may also be convenient to create a .theanorc file in your home directory to make these 
configurations permanent. For example, to always use float32 and the GPU, you can create such a 
.theanorc file including these settings. The command 1s as follows: 


echo -e "\n[global]\nfloatX=float32\ndevice=gpu\n" >> ~/.theanorc 


If you are not operating on a MacOS X or Linux terminal, you can create a .theanorc file manually 
using your favorite text editor and add the following contents: 


[global] 
FloatxX=float32 
device=gpu 


Now that we know how to configure Theano appropriately with respect to our available hardware, 
we can discuss how to use more complex array structures in the next section. 
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Working with array structures 


In this section, we will discuss how to use array structures in Theano using its tensor module. By 
executing the following code, we will create a simple 2 x 3 matrix, and calculate the column sums 
using Theano's optimized tensor expressions: 


>>> import numpy as np 
eee a bily oa s lates 
>>> x = T.fmatrix(name=!x') 


27> Kk Sum = T.sum(x, axis=0) 


# compile 
Por Cale sum = Eheeno. TUNCLIOn I npULe=|x<), OUEPDULSs=x_sum) 


# execute (Python list) 


eer ary = Lily 2, oly lige Ze 3) 
Per PrAne( COLUM Sumi", Cale Summary) ) 
Column sum: [ 2. A. 6] 


# execute (NumPy array) 

Pree ary = Hp.array (i ily Ze oly Lily, Ze. -ohkiy 

‘<4 dtype=theano.config.floatxX) 
27? Prine (*COLuUmm SUM: ”, Calc Sumtary).) 

Column sum: | 2. 4. 6. | 


As we Saw earlier, there are just three basic steps that we have to follow when we are using Theano: 
defining the variable, compiling the code, and executing it. The preceding example shows that Theano 
can work with both Python and NumPy types: 1ist and numpy.ndarray. 


Note 


Note that we used the optional name argument (here, x) when we created the fmatrix 
Tensor Variable, which can be helpful to debug our code or print the Theano graph. For example, if 
we'd print the fmatrix symbol x without giving it a name, the print function would return its 


Tensor 7pe. 


Ae Paint (x) 
<TensorType (float32, matrix)> 


However, if the TensorVariable was initialized with a name argument x as in our preceding 
example, it would be returned by the print function: 


Poo Print (x) 
Xx 


The TensorType can be accessed via the type method: 


>>> print(x.type() ) 
<TensorType (float32, matrix)> WOW! eBook 
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Theano also has a very smart memory management system that reuses memory to make it fast. More 
concretely, Theano spreads memory space across multiple devices, CPUs and GPUs; to track changes 
in the memory space, it aliases the respective buffers. Next, we will take a look at the shared 
variable, which allows us to spread large objects (arrays) and grants multiple functions read and 
write access, so that we can also perform updates on those objects after compilation. A detailed 
description of the memory handling in Theano 1s beyond the scope of this book. Thus, I encourage you 
to follow-up on the up-to-date information about Theano and memory management at 


http://deeplearning.net/software/theano/tutorial/aliasing. html. 


# initialize 

oor &< = T.fmatrix(*x*) 

>>> w = theano.shared(np.asarray([[0.0, 0.0, 0.0]], 
dtype=theano.config.floatxX) ) 

>>> z = X.dot(w.T) 

>>> update = [[w, w + 1.0]] 


# compile 

>> Net _iInpuLl = Ltheano.fUncl1on (anpurSs=(xl; 
updates=update, 
outputs=z) 


# execute 
Zoo Odta = fpwarray Ci lly 2y ol ly 

B sees dtype=theano.config.floatxX) 
Zo FOr © 1am Pange(5)* 


Se print ('z%d:' % i, net input (data) ) 
Ze el Ded] 
ZA Ml Ge I] 
Bo Aw Ih 
Zoe. tl dow Il 
Zo Ll 2a.) 


As you can see, sharing memory via Theano is really easy: In the preceding example, we defined an 
update variable where we declared that we want to update an array w by a value 1.0 after each 
iteration in the for loop. After we defined which object we want to update and how, we passed this 
information to the update parameter of the theano. function compiler. 


Another neat trick in Theano 1s to use the givens variable to insert values into the graph before 
compiling it. Using this approach, we can reduce the number of transfers from RAM over CPUs to 
GPUs to speed up learning algorithms that use shared variables. If we use the inputs parameter in 
theano. function, data 1s transferred from the CPU to the GPU multiple times, for example, if we 
iterate over a dataset multiple times (epochs) during gradient descent. Using givens, we can keep the 
dataset on the GPU if it fits into 1ts memory (for example, if we are learning with mini-batches). The 
code is as follows: 

# initialize 

2 > Cala = ND.aktray( iil, Ze oily 

dtype=theano.config.floatxX) 


> > x Per Maer ioc * 5) WOW! eBook 
>>> w = theano.shared(np.asarray ( WWWwWowebdak.dtg?! 1, 


has dtype=theano.config.floatx) ) 
>>> z = X.dot(w.T) 
>>> update = [|[w, w + 1.0] ] 


# compile 

27> Nel. 1npuc = Lheano.rUunecl Lom (anpurs=([), 
updates=update, 
gGivens={x: data}, 
outputs=z) 


# execute 

>>> for 1 in range(5): 

ae Pent. 2a, Det. tapue > 
Z02 Ti 0 
ZA [© 
Z2? Vi ob 
Zoe il. ab 
Zoe fit 2 


Looking at the preceding code example, we also see that the givens attribute is a Python dictionary 
that maps a variable name to the actual Python object. Here, we set this name when we defined the 


Eimec ra. x; 
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Wrapping things up — a linear regression example 


Now that we familiarized ourselves with Theano, let's take a look at a really practical example and 
implement Ordinary Least Squares (OLS) regression. For a quick refresher on regression analysis, 
please refer to Chapter 10, Predicting Continuous Target Variables with Regression Analysis. 


Let's start by creating a small one-dimensional toy dataset with five training samples: 


277 K Ciel = Np.asaerray (il l0.0l, [i600]; 

[2a boa, 

[a.0ig [£oe.0ry 

[ee0ly [7.0], 

[8.0], [9.0] ], 
aces dtype=theano.config.floatxX) 
ee ff eee = Aedes ee Tite deoe 

Sade Za 

Daly Os Dy 

C2Oy 7 otty 

Say Da0 ly 

dtype=theano.config.floatx) 


Note that we are using theano.config.floatx when we construct the NumPy arrays, so we can 
optionally toggle back and forth between CPU and GPU if we want. 


Next, let's implement a training function to learn the weights of the linear regression model, using the 


sum of squared errors cost function. Note that o is the bias unit (the y axis intercept at * = 0 ). The 
code is as follows: 


import theano 
from theano import tensor as T 
import numpy as np 


Cer “rain Jaregix Evans. Y train, Cla, SpOCcis) = 


costs = [] 

# Initialize arrays 

etaQ = T.fscalar('eta0Q') 

y = T.fvector (name='y'") 

X = T.fmatrix(name=!X"') 

w = theano.shared(np. zeros ( 
shape=(% train.shapel i) a L)y 
dtype=theano.config.floatxX), 


name='w') 
# Calculate cost 
het. 1npul = Twdot (x, wllt]) + wld 
errors = 7 = fee 1c 
cost = T.sum(T.pow(errors, 2)) 
# perform gradient update WOW! eBook 


gradient = T.grad(cost, wrt =w\Wwww.wowebook.org 


update = [(w, w - etaO * gradient) ] 


# compile model 
train = theano.function(inputs=[eta0], 
CUCDHUTS=COsE, 
updates=update, 
Givens=ix?. x train, 
Vi Y trea, 7) 


FOr di range (epocns) : 
costs.append (train (eta) ) 


rSecCurn COStsS; Ww 


A really nice feature in Theano 1s the grad function that we used in the preceding code example. The 
grad function automatically computes the derivative of an expression with respect to its parameters 
that we passed to the function as the wrt argument. 


After we implemented the training function, let's train our linear regression model and take a look at 
the values of the Sum of Squared Errors (SSE) cost function to check if 1t converged: 

2o> IMDOLrt. MabplLotlib.pyplou as: pli 

Pee COSlS, W= Crain J1mreg(x Eran, Ytrain, eta-U;.001, epochs—l0) 

>>> plt.plot(range(1l, len(costs)+1), costs) 

Por Dilestrgat layout () 

Poe Dit se elabes,(" BDOCN » 


2 Dll. vlabelit Cos.” j 
>>> plt.show() 


As we can see 1n the following plot, the learning algorithm already converged after the fifth epoch: 
350 | 
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So far so good; by looking at the cost function, it seems that we built a working regression model 
from this particular dataset. Now, let's compile a new function to make predictions based on the input 
features: 


Ger predicr Jimreq (x, Ww): 


Xt = T.matrix(name=!'X') 
net 2npul = P.a0C (xt, wills }) = wile 
predict = theano.function(inputs=[Xt], 


gGivens={w: w}, 
OULPUGS=nee Anput) 
return predict (X) 


Implementing a predict function was pretty straightforward following the three-step procedure of 
Theano: define, compile, and execute. Next, let's plot the linear regression fit on the training data: 


22 PilesseCalcCer (xX tiarn, 
yy train, 
marker='"s', 
Les S=90) 
e7> DLE-«PlOU (range (x train.shape Ul), 
predict linreg(X train, Ww), 
COLOTr="GQray"; 
marker='o!', 
markersize=4, 
Se linewidth=3) 
Poor DEL. kabel" x* } 
>>> plt.ylabel('y') 
Soo DLT «show () 


As we can see in the resulting plot, our model fits the data points appropriately: 
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Implementing a simple regression model was a good exercise to become familiar with the Theano 
API. However, our ultimate goal is to play out the advantages of Theano, that is, implementing 
powerful artificial neural networks. We should now be equipped with all the tools we would need to 
implement the multilayer perceptron from Chapter 12, 7raining Artificial Neural Networks for 
Image Recognition, 1n Theano. However, this would be rather boring, right? Thus, we will take a 
look at one of my favorite deep learning libraries built on top of Theano to make the experimentation 
with neural networks as convenient as possible. However, before we introduce the Keras library, 
let's first discuss the different choices of activation functions in neural networks in the next section. 
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Choosing activation functions for feedforward 
neural networks 


For simplicity, we have only discussed the sigmoid activation function 1n context of multilayer 
feedforward neural networks so far; we used in the hidden layer as well as the output layer 1n the 
multilayer perceptron implementation in Chapter 12, Training Artificial Neural Networks for Image 
Recognition. Although we referred to this activation function as sigmoid function—as it is commonly 
called in literature—the more precise definition would be Jogistic function or negative log- 
likelihood function. In the following subsections, you will learn more about alternative sigmoidal 
functions that are useful for implementing multilayer neural networks. 


Technically, we could use any function as activation function in multilayer neural networks as long as 
itis differentiable. We could even use linear activation functions such as in Adaline (Chapter 2, 
Training Machine Learning Algorithms for Classification). However, in practice, 1t would not be 
very useful to use linear activation functions for both hidden and output layers, since we want to 
introduce nonlinearity 1n a typical artificial neural network to be able to tackle complex problem 
tasks. The sum of linear functions yields a linear function after all. 


The logistic activation function that we used in the previous chapter probably mimics the concept of a 
neuron 1n a brain most closely: we can think of 1t as probability of whether a neuron fires or not. 
However, logistic activation functions can be problematic if we have highly negative inputs, since the 
output of the sigmoid function would be close to zero in this case. If the sigmoid function returns 
outputs that are close to zero, the neural network would learn very slowly and it becomes more likely 
that it gets trapped in local minima during training. This 1s why people often prefer a hyperbolic 
tangent as activation function in hidden layers. Before we discuss what a hyperbolic tangent looks 
like, let's briefly recapitulate some of the basics of the logistic function and look at a generalization 
that makes it more useful for multi-class classification tasks. 
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Logistic function recap 


As we mentioned it in the introduction to this section, the logistic function, often just called the 
sigmoid function, 1s 1n fact a special case of a sigmoid function. We recall from the section on 
logistic regression in Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn, that we 
can use the logistic function to model the probability that sample ** belongs to the positive class 
(class 1) ina binary classification task: 

| 
l+e° 





Provistic ( z) =] 


Here, the scalar variable = is defined as the net input: 


if} 
-— — a "a ay - — = 4. oo mi = 
Z=WX, t+ W,X,, = » 2, Ww, =w x 


nt” 6 


Note that '"® is the bias unit (y-axis intercept, i ). To provide a more concrete example, let's 
assume a model for a two-dimensional data point x and a model with the following weight 
coefficients assigned to the vector '’: 


Por KX |= Nprarray( | itl, a4, 129] 
2oo> WwW = Topwarray( (0.0, O.2, 0:.4]) 


Pe? Oet Nel AnpuUL (x, Wy: 
Z = X.dot (w) 
recUurLn Z 


>>> def logistic(z): 
return 1.0 / (1.0 + np.exp(-z)) 


Pom OEr LOGlsStiC a2culvation (x, WwW): 
Z = NS anpuTr (x, w) 
return logistic(z) 


2o> PLine( PP ty=i'|x) = c.3t! 
oo x 6 LOGLStIC aCtiVatloni( xX, w) 10] ) 
P(y=l1|x) = 0.707 


If we calculate the net input and use it to activate a logistic neuron with those particular feature values 
and weight coefficients, we get back a value of 0.707, which we can interpret as a 70.7 percent 
probability that this particular sample * belongsyto.fag positive class. In Chapter 12, Training 
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Artificial Neural Networks for Image Recognition, we used the one-hot encoding technique to 
compute the values in the output layer consisting of multiple logistic activation units. However, as we 
will demonstrate with the following code example, an output layer consisting of multiple logistic 
activation units does not produce meaningful, interpretable probability values: 


# W : array, shape = [n output units, n hidden units+1] 

it Weight matrix for hidden layer -> output layer. 
# note that first column (A[:][0] 1) are the bias units 
>>> W = np.array([[1.1, 


eee ang 
[Okey “Oey 
Orrer 


ES Cy 3 ||| 


5 | 
7s) 
cal 


Do. 
hous 
~ ~ 
Lg ~~ 


) 


# A : array, Shape = [n hiddent+l, n samples] 

it Activation of hidden layer. 

# note that first element (A[0][0] = 1) is the bias unit 
eos Fe = Toxwarray( | [1.0 


#8 ™ ~ ~ 


[O.1] 
eres 
[Oat I) 
# Z : array, shape = [n output units, n samples] 
it Net input of the output layer. 
>>> Z = W.dot (A) 
Peo VF Probes = 1O01s tie (7) 
>>> print('Probabilities:\n', y probas) 
Probabilities: 

Lh. hee? O5.52 75:1 

[| Uso 60052 Ol 

[ 0.90114393] ] 


As we can see 1n the output, the probability that the particular sample belongs to the first class 1s 
almost 88 percent, the probability that the particular sample belongs to the second class 1s almost 58 
percent, and the probability that the particular sample belongs to the third class 1s 90 percent, 
respectively. This is clearly confusing, since we all know that a percentage should intuitively be 
expressed as a fraction of 100. However, this is in fact not a big concern 1f we only use our model to 
predict the class labels, not the class membership probabilities. 


Pro ¥ Class. = Np.argqmax(A, axis=0) 


O O 


ZO? DEIN PpLecrered Class labels «cd" = 7 Class )0] 
predicted class label: 2 


However, in certain contexts, it can be useful to return meaningful class probabilities for multi-class 
predictions. In the next section, we will take a look at a generalization of the logistic function, the 
softmax function, which can help us with this task. 
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Estimating probabilities in multi-class classification via 
the softmax function 


The softmax function is a generalization of the logistic function that allows us to compute meaningful 
class-probabilities in multi-class settings (multinomial logistic regression). In softmax, the 


probability of a particular sample with net input 2 belongs to the ! thclass can be computed with a 
normalization term in the denominator that is the sum of all ‘/ linear functions: 


be 


. | 
P(y =i] | z) a ae (2) = = 
» deol mn 


To see softmax in action, let's code it up in Python: 


>> Cer SOLtmax (Zz): 
return np.exp(z) / np.sum(np.exp(z) ) 


eer OST SOfumax O@CULValvon (x, Ww) 
Z = net input(X, w) 
return sigmoid(z) 


2p Y probas = SOLrtTmMax (4) 
>>> print('Probabilities:\n', y probas) 
Probabilities: 
[[ 0.40386493] 
[ 0.07756222] 
| Ds S16 57264 I] 
>>> y probas.sum() 
seen © 


As we can see, the predicted class probabilities now sum up to one, as we would expect. It 1s also 


notable that the probability for the second class is close to zero, since there is a large gap between =) 


max (z ; ; ; _ ; 
and (2) . However, note that the predicted class label is the same as 1n the logistic function. 
Intuitively, it may help to think of the softmax function as a normalized logistic function that 1s useful 


to obtain meaningful class-membership predictions in multi-class settings. 


Po? ¥ Class = fPsargmaex (7, axis=0) 
>>> print('predicted class label: 
eae ea @ VY Class [0 ]:) 
predicted class label: 2 


WOW! eBook 
www.wowebook.org 


Broadening the output spectrum by using a hyperbolic 
tangent 


Another sigmoid function that is often used 1n the hidden layers of artificial neural networks is the 
hyperbolic tangent (tanh), which can be interpreted as a rescaled version of the logistic function. 


Pros (z) =2x Proistic (2 " z) =l= = 





7 
Provistic | z) si 


-pe* 


logistic(2xz)x 2-1 


The advantage of the hyperbolic tangent over the logistic function 1s that it has a broader output 
spectrum and ranges the open interval (-1, 1), which can improve the convergence of the back 
propagation algorithm (C. M. Bishop. Neural networks for pattern recognition. Oxford university 
press, 1995, pp. 500-501). In contrast, the logistic function returns an output signal that ranges the 
open interval (0, 1). For an intuitive comparison of the logistic function and the hyperbolic tangent, 
let's plot two sigmoid functions 1n a one-dimensional space: 


o> AMpOLL. Matp bot lab.s.pyolot as ple 


Sor OCrL tanh (Z) : 


© > = Npeexp17Z) 
eM. = Mpwexp (=z) 
return (e p - em) / (e p + e m) 


Pee wm = NOearange(— 5, 97 U.005) 
a HOC Ce = LOGS Ie (Zz) 
Po? tain, ecu = Lani (Z) 


2 Dice Perm Poe ikeoy 2a.) 

S>> plestlebel 0"net anpue ozs") 

So> Plt.vlabel ("activation © \phi.(2) >") 

ve> DPiteexhlane (ly Color="bilack*, Jinestyle=' ==") 
>or Dilt.exnlane (0.5, COlor='"black", Linestyle="—=—") 
>>> plt.axhline(0, color='black', LiWehHYBGok ~~ |) 
>>> plt.axhline(-1, color='"black' www.wewebdelcorg-') 


oe Die PLOClZ, tanh act, 
linewidth=2, 
COlLOr="DLack’ , 

Soae label='tanh') 

eer DitsOlLOE(Z;, LOG aC, 
linewidth=2, 
color='lightgreen', 
label='logistic') 


>>> plt.legend(loc='lower right") 


er PilietLgnit. Layout) 
>>> plt.show() 


As we can see, the shapes of the two sigmoidal curves look very similar; however, the tanh function 
has 2x larger output space than the logistic function: 


1.5, 


0.5| 


o 
= 


activation (=) 


~1.0) | =— tann 
logistic 





6 -4 ~2 0 2 4 : 
net input z 


Note that we implemented the logistic and tanh functions verbosely for the purpose of illustration. 
In practice, we can use NumPy's tanh function to achieve the same results: 


27> Latin acl = Nps tanh (Zz) 


In addition, the logistic function is available in SciPy's special module: 


>>> from scipy.special import expit 
27 MOG ACe: = Sxp101(Z) 


Now that we know more about the different aetmatieadiunctions that are commonly used in artificial 
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neural networks, let's conclude this section with an overview of the different activation function that 
we encountered 1n this book. 


Yaar lalelsmitlirmacely Equation Example 1D Graph 























Unit step 0 z2<0 Perceptron 
(Heaviside) pz)= 40.5, z=0, variant 
l, z>O 
Sign (Signum) fa], ¢<-6: Perceptron | 
d(z)=40, z=0, variant 
| z>0 
Linear Adaline, linear - 
Pz) =z regression 
Piece-wise linear 1, Z> * Support vector 
d(zj=4z+4, -it<z<i, machine | 
0 ‘A < = 
Logistic (sigmoid) , Logistic , 
pz) = = regression, ; 
Multi-layer NN 
Hyperbolic tangent #(2) = e —e* Multi-layer NN | = 
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Training neural networks efficiently using 
Keras 


In this section, we will take a look at Keras, one of the most recently developed libraries to facilitate 
neural network training. The development on Keras started 1n the early months of 2015; as of today, it 
has evolved into one of the most popular and widely used libraries that are built on top of Theano, 
and allows us to utilize our GPU to accelerate neural network training. One of its prominent features 
is that it's a very intuitive API, which allows us to implement neural networks in only a few lines of 
code. Once you have Theano installed, you can install Keras from PyPI by executing the following 
command from your terminal command line: 


pip install Keras 
For more information about Keras, please visit the official website at http://keras.io. 


To see what neural network training via Keras looks like, let's implement a multilayer perceptron to 
classify the handwritten digits from the MNIST dataset, which we introduced in the previous chapter. 
The MNIST dataset can be downloaded from http://yann.lecun.com/exdb/mnist/ 1n four parts as listed 
here: 


train-images-idx3-ubyte.gz: These are training set images (9912422 bytes) 
train-labels-idx1l-ubyte.gz: These are training set labels (28881 bytes) 
tl10k-images-idx3-ubyte.gz: These are test set images (1648877 bytes) 
t10k-labels-idxl-ubyte.gz: These are test set labels (4542 bytes) 


After downloading and unzipped the archives, we place the files into a directory mnist 1n our current 
working directory, so that we can load the training as well as the test dataset using the following 
function: 


import os 
ImpoOrkL SUeuCcr 
import numpy as np 


cet JO0ad Mnist (path, Kind="train”): 

mmuY'hoac MNIST data trom pach """ 

LAaDELS: Path: = O8.0aCn.JO1n (pata, 
'Ss-labels-idxl-ubyte' 
6 kind) 

images path = O8.pati.«.jJO1n{ parti, 
'Ss-images-1idx3-ubyte' 
6 kind) 


WLM Open Labels path, *ro") aS Loparn: 
Magic, 1 = SErUCT.Unpack (*7i1*, 
lbpath.read(8) ) 
labels = np.fromfile(lbpath, 
OW eReak 


dt =np. 
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With Open(iamages Dati, *o") as amgpach: 
magic, Num, rows, cols = Struct sunpack ("Fiilii™, 
imgpath.read(16) ) 
images = np.fromfile(imgpath, 
dtype=np.uint8).reshape(len(labels), 784) 


return images, labels 
KX Creat, Yo Crain = O40 Mish ( Mist”, Kino" train: ) 
Print (Rows: .«d;, Columns: <a" ~@ (Xx train. shape lll, x%  ~rain.shape|.1))) 
Rows: 60000, columns: 7/84 
x Dest; VY Test. = Looe mntet( mast, kino to) 
PrInt (ROWS: oC, COlUMmMS? <a" = (x test.shapeld)|;, xX test.shnape| 1 ):)) 
Rows: 1l0O000, columns: 784 


On the following pages, we will walk through the code examples for using Keras step by step, which 
you can directly execute from your Python interpreter. However, if you are interested in training the 
neural network on your GPU, you can either put it into a Python script, or download the respective 
code from the Packt Publishing website. In order to run the Python script on your GPU, execute the 
following command from the directory where the mnist keras mlp.py file 1s located: 


THEANO FLAGS=mode=FAST RUN,device=gpu,floatx=float32 python 
mnist keras mlp.py 


To continue with the preparation of the training data, let's cast the MNIST image array into 32-bit 
format: 


>>> import theano 

>>> theano.config.floatX = 'float32' 

27> K Train =X Creal nseastype (theanowconti1g.floatx) 
o>? KR TeSse = KX VesSt.astype (Cheano.contig., f10atx) 


Next, we need to convert the class labels (integers 0-9) into the one-hot format. Fortunately, Keras 
provides a convenient tool for this: 


oo? EVOm Keras.utiLs ImMpOre no Utils 


eo DEI eee Oo Lebel ee *, Vo teats) 
Hitec 2 tebels: io 0 4] 
veo YY Claim. One = Mp Uilts. (Oo Calegorical (ly train) 


2o> Prane(* \obizee 3 tabels (one=hor) s\n", 7 train one! ¢3))) 
Parse 3s tabels <one=—hov): 
fil GU. OO. OU. OO. G2. fe 8. ©. 2. 2.) 
foe Oe OO GW. Us Ws, Ge OG. ©. 
[O. 0. 8 UU. hs @s OO. @. O. 


Now, we can get to the interesting part and implement a neural network. Here, we will use the same 
architecture as in Chapter 12, Training Artificial Neural Networks for Image Recognition. However, 
we will replace the logistic units in the hidden layer with hyperbolic tangent activation functions, 
replace the logistic function in the output layer with softmax, and add an additional hidden layer. 


Keras makes these tasks very simple, as you can see in the following code implementation: 
WOW! eBook 
www.wowebook.org 


>>> from keras.models import Sequential 
>>> from keras.layers.core import Dense 
>>> from keras.optimizers import SGD 


27> Np«Lfancom.sSeed.(.1) 


>>> model = Sequential () 

>>> model.add(Dense (input dim=X train.shapel[1l], 
CUTPUL. Gim=50, 
Las o—" Un LTOrm” y 
activation='tanh') ) 


ser mode l..,aca (Dense (anpur GCim=o0, 
OUCpUL dam= 50, 
init='uniform', 
activation='tanh') ) 


oP > mode |..a0d (Dense (input. Cim=5U, 
CuUtTpUL. GCim=y train ohe.smape| i); 
Lie UrOrn 5 
activation='softmax') ) 


>>> sgd = SGD(lr=0.001, decay=le-7, momentum=. 9) 
Por Model ,Comp1..e(LoOss="Cavegori cal, Crossenltropy”, Oplimizer—soqd) 


First, we initialize a new model using the Sequential class to implement a feedforward neural 
network. Then, we can add as many layers to it as we like. However, since the first layer that we add 
is the input layer, we have to make sure that the input dim attribute matches the number of features 
(columns) 1n the training set (here, 768). Also, we have to make sure that the number of output units 
(output dim) and input units (input dim) of two consecutive layers match. In the preceding 
example, we added two hidden layers with 50 hidden units plus | bias unit each. Note that bias units 
are initialized to 0 in fully connected networks in Keras. This is in contrast to the MLP 
implementation in Chapter 12, Training Artificial Neural Networks for Image Recognition, where 
we initialized the bias units to 1, which is a more common (not necessarily better) convention. 


Finally, the number of units in the output layer should be equal to the number of unique class labels— 
the number of columns in the one-hot encoded class label array. Before we can compile our model, 
we also have to define an optimizer. In the preceding example, we chose a stochastic gradient descent 
optimization, which we are already familiar with, from previous chapters. Furthermore, we can set 
values for the weight decay constant and momentum learning to adjust the learning rate at each epoch 
as discussed in Chapter 12, 7raining Artificial Neural Networks for Image Recognition. Lastly, we 
set the cost (or loss) function to categorical crossentropy. The (binary) cross-entropy 1s just the 
technical term for the cost function in logistic regression, and the categorical cross-entropy 1s its 
generalization for multi-class predictions via softmax. After compiling the model, we can now train it 
by calling the £it method. Here, we are using mini-batch stochastic gradient with a batch size of 300 
training samples per batch. We train the MLP over 50 epochs, and we can follow the optimization of 
the cost function during training by setting verbose=1. The validation split parameter is 
especially handy, since it will reserve 10 percent of the training data (here, 6,000 samples) for 
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validation after each epoch, so that we can check if the model is overfitting during training. 


> MOOS. fic x Ered, 
y Teen One, 
no -epoCcn—50, 
batch S176=-300, 
verbose=l, 
Valicatiom Ssplit=0. 1; 
er show accuracy—[rue) 
Train on 54000 samples, validate on 6000 samples 


Epoch 0 

94000/54000 [SssssssssssssSssssssssssss====] - Ils - loss: 2.2290 - acc: 0.3592 
Vel #055) Zale = Val 2CC. Us, 0042 

Kpoch 1 

54000/54000 [Sssssssssssssssssssssssss=====] - Ils - loss: 1.8850 - acc: 0.5279 
Val 10Se. [e0070 = Val acc: U.5ol7 

Epoch 2 

94000/54000 [Sssssssssssssssssssssssss=====] - Ils - loss: 1.3903 - acc: 0.5884 
Vou, OSes. LedlooG = Val OCCe U.s070 7 

Epoch 3 

54000/54000 [SsssSssssssssssssssssssss=====] - ls - loss: 1.0592 - acc: 0.6936 


Val OSes Ue.o9oL = Val acee U.7/olo5 


Lae 

Epoch, 49 
94000/54000 [SsssssssssssSSsSssSsssssss====] - Ils - loss: 0.1907 - acc: 0.9432 
Val oss: Os l74o = val 2cey 027432 


Printing the value of the cost function is extremely useful during training, since we can quickly spot 
whether the cost 1s decreasing during training and stop the algorithm earlier if otherwise to tune the 
hyperparameters values. 


To predict the class labels, we can then use the predict classes method to return the class labels 
directly as integers: 


Po? y Lfaim pred. = model.,predrcre Classes (x train, vertbose=0) 
PoP PLEIN "PLrSt oS predictions, “, VY train pred|s2]) 
Poe EFLESt. 9 DPESCdLCELONnS: [5 QO 4] 


Finally, let's print the model accuracy on training and test sets: 


eer Crain ace = Npasum | 

- y train == y train pred, axis=0) / X train.shape[0] 
Poo Dried TroaaminG eCCUuLecCy: cuviteo @ ({Lraimwace = LUD) ) 
Training accuracy: 94.512 


PrP VY Test pred = mooel.predice Classes (% test, Veroose= 0) 
eo Vest 2Ce = Np.sumty test == 7 Test preg, 

jas axis=0) / X test.shape[0] 

Prine Test accuracy: «.Zloo” G (vest ace = 100) ) 

Test accuracy: 94.39% 


Note that this 1s just a very simple neural network without optimized tuning parameters. If you are 
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interested in playing more with Keras, please feel free to further tweak the learning rate, momentum, 
weight decay, and number of hidden units. 


Note 


Although Keras 1s great library for implementing and experimenting with neural networks, there are 
many other Theano wrapper libraries that are worth mentioning. A prominent example is Pylearn2 
(http://deeplearning.net/software/pylearn2/), which has been developed in the LISA lab 1n Montreal. 
Also, Lasagne (https://github.com/Lasagne/Lasagne) may be of interest to you 1f you prefer a more 
minimalistic but extensible library, that offers more control over the underlying Theano code. 
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Summary 


I hope you enjoyed this last chapter of an exciting tour of machine learning. Throughout this book, we 
covered all of the essential topics that this field has to offer, and you should now be well equipped to 
put those techniques into action to solve real-world problems. 


We started our journey with a brief overview of the different types of learning tasks: supervised 
learning, reinforcement learning, and unsupervised learning. We discussed several different learning 
algorithms that can be used for classification, starting with simple single-layer neural networks in 
Chapter 2, Training Machine Learning Algorithms for Classification. Then, we discussed more 
advanced classification algorithms in Chapter 3, A Tour of Machine Learning Classifiers Using 
Scikit-learn, and you learned about the most important aspects of a machine learning pipeline in 
Chapter 4, Building Good Training Sets — Data Preprocessing and Chapter 5, Compressing Data 
via Dimensionality Reduction. Remember that even the most advanced algorithm 1s limited by the 
information in the training data that it gets to learn from. In Chapter 6, Learning Best Practices for 
Model Evaluation and Hyperparameter Tuning, you learned about the best practices to build and 
evaluate predictive models, which is another important aspect in machine learning applications. If 
one single learning algorithm does not achieve the performance we desire, it can sometimes be 
helpful to create an ensemble of experts to make a prediction. We discussed this in Chapter 7, 
Combining Different Models for Ensemble Learning. In Chapter 8, Applying Machine Learning to 
Sentiment Analysis, we applied machine learning to analyze the probably most interesting form of 
data in the modern age that is dominated by social media platforms on the Internet: text documents. 
However, machine learning techniques are not limited to offline data analysis, and in Chapter 9, 
Embedding a Machine Learning Model into a Web Application, we saw how to embed a machine 
learning model into a web application to share it with the outside world. For the most part, our focus 
was on algorithms for classification, probably the most popular application of machine learning. 
However, this 1s not where it ends! In Chapter 10, Predicting Continuous Target Variables with 
Regression Analysis, we explored several algorithms for regression analysis to predict continuous- 
valued output values. Another exciting subfield of machine learning is clustering analysis, which can 
help us to find hidden structures in data even if our training data does not come with the right answers 
to learn from. We discussed this in Chapter 11, Working with Unlabeled Data — Clustering Analysis. 


In the last two chapters of this book, we caught a glimpse of the most beautiful and most exciting 
algorithms in the whole machine learning field: artificial neural networks. Although deep learning 
really is beyond the scope of this book, I hope I could at least kindle your interest to follow the most 
recent advancement in this field. If you are considering a career as machine learning researcher, or 
even if you just want to keep up to date with the current advancement in this field, I can recommend 
you to follow the works of the leading experts 1n this field, such as Geoff Hinton 
(http://www.cs.toronto.edu/~hinton/), Andrew Ng (http://www.andrewng.org), Yann LeCun 
(http://yann.lecun.com), Juergen Schmidhuber (http://people.idsia.ch/~juergen/), and Yoshua Bengio 
(http://www.iro.umontreal.ca/~bengioy), just to name a few. Also, please do not hesitate to join the 
scikit-learn, Theano, and Keras mailing lists to articipate in interesting discussions around these 
libraries, and machine learning in general. yan ookingiforyard to meet you there! You are always 





welcome to contact me if you have any questions about this book or need some general tips about 
machine learning. 


I hope this journey through the different aspects of machine learning was really worthwhile, and you 
learned many new and useful skills to advance your career and apply them to real-world problem 
solving. 
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Part 2. Module 2 


Designing Machine Learning Systems with Python 


Leverage benefits of machine learning techniques using Python 
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Chapter 1. Thinking in Machine Learning 


Machine learning systems have a profound and exciting ability to provide important insights to an 
amazing variety of applications; from groundbreaking and life-saving medical research, to 
discovering fundamental physical aspects of our universe. From providing us with better, cleaner 
food, to web analytics and economic modeling. In fact, there are hardly any areas of our lives that 
have not been touched by this technology in some way. With an expanding Internet of Things, there is 

a staggering amount of data being generated, and it is clear that intelligent systems are changing 
societies 1n quite dramatic ways. With open source tools, such as those provided by Python and its 
libraries, and the increasing open source knowledge base represented by the Web, it 1s relatively easy 
and cheap to learn and apply this technology in new and exciting ways. In this chapter, we will cover 
the following topics: 


Human interface 

Design principles 

Models 

Unified modelling language 
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The human interface 


For those of you old enough, or unfortunate enough, to have used early versions of the Microsoft 
office suite, you will probably remember the Mr Clippy office assistant. This feature, first introduced 
in Office 97, popped up uninvited from the bottom right-hand side of your computer screen every time 
you typed the word 'Dear' at the beginning of a document, with the prompt "it looks like you are 
writing a letter, would you like help with that?". 


Mr Clippy, turned on by default in early versions of Office, was almost universally derided by users 
of the software and could go down in history as one of machine learning's first big fails. 


So, why was the cheery Mr Clippy so hated? Clearly the folks at Microsoft, at the forefront of 
consumer software development, were not stupid, and the idea that an automated assistant could help 
with day to day office tasks is not necessarily a bad idea. Indeed, later incarnations of automated 
assistants, the best ones at least, operate seamlessly in the background and provide a demonstrable 
increase 1n work efficiency. Consider predictive text. There are many examples, some very funny, of 
where predictive text has gone spectacularly wrong, but in the majority of cases where it doesn't fail, 
it goes unnoticed. It just becomes part of our normal work flow. 


At this point, we need a distinction between error and failure. Mr Clippy failed because it was 
obtrusive and poorly designed, not necessarily because it was in error; that is, 1t could make the right 
suggestion, but chances are you already know that you are writing a letter. Predictive text has a high 
error rate, that is, 1t often gets the prediction wrong, but it does not fail largely because of the way it 
is designed to fail: unobtrusively. 


The design of any system that has a tightly coupled human interface, to use systems engineering 
speak, is difficult. Human behavior, like the natural world in general, is not something we can always 
predict. Expression recognition systems, natural language processing, and gesture recognition 
technology, amongst other things, all open up new ways of human-machine interaction, and this has 
important applications for the machine learning specialist. 


Whenever we are designing a system that requires human input, we need to anticipate the possible 
ways, not just the intended ways, a human will interact with the system. In essence, what we are 
trying to do with these systems is to instil 1n them some understanding of the broad panorama of 
human experience. 


In the first years of the web, search engines used a simple system based on the number of times search 
terms appeared in articles. Web developers soon began gaming the system by increasing the number 
of key search terms. Clearly, this would lead to a keyword arms race and result in a very boring web. 
The page rank system measuring the number of quality inbound links was designed to provide a more 
accurate search result. Now, of course, modern search engines use more sophisticated and secret 
algorithms. 
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What is also important for ML designers is the ever increasing amount of data that is being generated. 
This presents several challenges, most notably its sheer vastness. However, the power of algorithms 
in extracting knowledge and insights that would not have been possible with smaller data sets 1s 
massive. So, many human interactions are now digitized, and we are only just beginning to understand 
and explore the many ways 1n which this data can be used. 

As a curious example, consider the study 7he expression of emotion in 20th century books (Acerbi 
et al, 2013). Though strictly more of a data analysis study, rather than machine learning, it 1s 
illustrative for several reasons. Its purpose was to chart the emotional content, in terms of a mood 


score, of text extracted from books of the 20th century. With access to a large volume of digitized text 


through the project Gutenberg digital library, WordNet (http://wordnet.princeton.edu/wordnet/), and 
Google's Ngram database (books.google.com/ngrams), the authors of this study were able to map 


cultural change over the 20th century as reflected in the literature of the time. They did this by 
mapping trends 1n the usage of the mood words. 


For this study, the authors labeled each word (a J gram) and associated it with a mood score and the 
year it was published. We can see that emotion words, such as joy, sadness, fear, and so forth, can be 
scored according to the positive or negative mood they evoke. The mood score was obtained from 
WordNet (wordnet.princeton.edu). WordNet assigns an affect score to each mood word. Finally, the 
authors simply counted the occurrences of each mood word: 





nN ep MM — : 
M=-> = M. apn ad 


i Ong 


I=] “fhe 


Here, ci 1s the count of a particular mood word, v 1s the total count of mood words (not all words, 
just words with a mood score), and Cj, is the count of the word the in the text. This normalizes the 
sum to take into account that some years more books were written (or digitized). Also, since many 
later books tend to contain more technical language, the word the was used to normalize rather than 
get the total word count. This gives a more accurate representation of emotion over a long time period 
in prose text. Finally, the score 1s normalized according to a normal distribution, M@,, by subtracting 


the mean and dividing by the standard deviation. 
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This figure 1s taken from The expression of Emotions in 20th Century Books, (Alberto Acerbi, 
Vasileios Lampos, Phillip Garnett, R. Alexander Bentley) PLOS. 


Here we can see one of the graphs generated by this study. It shows the joy-sadness score for books 
written in this period, and clearly shows a negative trend associated with the period of World War IL. 


This study is interesting for several reasons. Firstly, it is an example of data-driven science, where 
previously considered soft sciences, such as sociology and anthropology, are given a solid empirical 
footing. Despite some pretty impressive results, this study was relatively easy to implement. This is 
mainly because most of the hard work had already been done by WordNet and Google. This highlights 
how using data resources that are freely available on the Internet, and software tools such as the 
Python's data and machine learning packages, anyone with the data skills and motivation can build on 
this work. 
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Design principles 


An analogy is often made between systems design and designing other things such as a house. To a 
certain extent, this analogy holds true. We are attempting to place design components into a structure 
that meets a specification. The analogy breaks down when we consider their respective operating 
environments. It is generally assumed in the design of a house that the landscape, once suitably 
formed, will not change. 


Software environments are slightly different. Systems are interactive and dynamic. Any system that 
we design will be nested inside other systems, either electronic, physical, or human. In the same way 
different layers in computer networks (application layer, transport layer, physical layer, and so on) 
nest different sets of meanings and function, so to do activities performed at different levels ofa 
project. 


As the designer of these systems, we must also have a strong awareness of the setting, that 1s, the 
domain in which we work. This knowledge gives us clues to patterns 1n our data and helps us give 
context to our work. 


Machine learning projects can be divided into five distinct activities, shown as follows: 


Defining the object and specification 
Preparing and exploring the data 
Model building 

Implementation 

Testing 

Deployment 


The designer is mainly concerned with the first three. However, they often play, and in many projects 
must play, a major role in other activities. It should also be said that a project's timeline is not 
necessarily a linear sequence of these activities. The important point is that they are distinct 
activities. They may occur in parallel to each other, and in other ways interact with each other, but 
they generally involve different types of tasks that can be separated in terms of human and other 
resources, the stage of the project, and externalities. Also, we need to consider that different 
activities involve distinct operational modes. Consider the different ways in which your brain works 
when you are sketching out an idea, as compared to when you are working on a specific analytical 
task, say a piece of code. 


Often, the hardest question is where to begin. We can start drilling into the different elements of a 
problem, with an idea of a feature set and perhaps an idea of the model or models we might use. This 
may lead to a defined object and specification, or we may have to do some preliminary research such 
as checking possible data sets and sources, available technologies, or talking to other engineers, 
technicians, and users of the system. We need to explore the operating environment and the various 


constraints; 1s it part of a web application, or is it a laboratory research tool for scientists? 
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In the early stages of design, our work flow will flip between working on the different elements. For 
instance, we start with a general problem—perhaps having an idea of the task, or tasks, necessary to 
solve it—then we divide it into what we think are the key features, try it out on a few models witha 
toy dataset, go back to refine the feature set, adjust our model, precisely define tasks, and refine the 
model. When we feel our system 1s robust enough, we can test it out on some real data. Of course, 
then we may need to go back and change our feature set. 


Selecting and optimizing features is often a major activity (really, a task in itself) for the machine 
learning designer. We cannot really decide what features we need until we have adequately described 
the task, and of course, both the task and features are constrained by the types of feasible models we 
can build. 
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Types of questions 


As designers, we are asked to solve a problem. We are given some data and an expected output. The 
first step 1s to frame the problem in a way that a machine can understand it, and in a way that carries 
meaning for a human. The following six broad approaches are what we can take to precisely define 
our machine learning problem: 


Exploratory: Here, we analyze data, looking for patterns such as a trend or relationship 
between variables. Exploration will often lead to a hypothesis such as linking diet with disease, 
or crime rate with urban dwellings. 

Descriptive: Here, we try to summarize specific features of our data. For instance, the average 
life expectancy, average temperature, or the number of left-handed people 1n a population. 
Inferential: An inferential question 1s one that attempts to support a hypothesis, for instance, 
proving (or disproving) a general link between life expectancy and income by using different 
data sets. 

Predictive: Here, we are trying to anticipate future behavior. For instance, predicting life 
expectancy by analyzing income. 

Casual: This is an attempt to find out what causes something. Does low income cause a lower 
life expectancy? 

Mechanistic: This tries to answer questions such as "what are the mechanisms that link income 
with life expectancy?" 


Most machine learning problems involve several of these types of questions during development. For 
instance, we may first explore the data looking for patterns or trends, and then we may describe 
certain key features of our data. This may enable us to make a prediction, and find a cause or a 
mechanism behind a particular problem. 


WOW! eBook 
www.wowebook.org 


Are you asking the right question? 


The question must be plausible and meaningful in its subject area. This domain knowledge enables 
you to understand the things that are important in your data and to see where a certain pattern or 
correlation has meaning. 


The question should be as specific as possible, while still giving a meaningful answer. It 1s common 
for it to begin as a generalized statement, such as "I wonder if wealthy means healthy". So, you do 
some further research and find you can get statistics for wealth by geographic region, say from the tax 
office. We can measure health through its inverse, that is, 11Iness, say by hospital admissions, and we 
can test our initial proposition, "wealthy means healthy", by tying illness to geographic region. We 
can see that a more specific question relies on several, perhaps questionable, assumptions. 


We should also consider that our results may be confounded by the fact that poorer people may not 
have healthcare insurance, so are less likely to go to a hospital despite illness. There 1s an interaction 
between what we want to find out and what we are trying to measure. This interaction perhaps hides a 
true rate of illness. All 1s not lost, however. Because we know about these things, then perhaps we 
can account for them in our model. 


We can make things a lot easier by learning as much as we can about the domain we are working in. 


You could possibly save yourself a lot of time by checking whether the question you are asking, or 
part of it, has already been answered, or if there are data sets available that may shed some light on 
that topic. Often, you have to approach a problem from several different angles at once. Do as much 
preparatory research as you can. It is quite likely that other designers have done work that could shed 
light on your own. 
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Tasks 


A task is a specific activity conducted over a period of time. We have to distinguish between the 
human tasks (planning, designing, and implementing) to the machine tasks (classification, clustering, 
regression, and so on). Also consider when there is overlap between human and machine, for 
example, as in selecting features for a model. Our true goal in machine learning is to transform as 
many of these tasks as we can from human tasks to machine tasks. 


It is not always easy to match a real world problem to a specific task. Many real world problems may 
seem to be conceptually linked but require a very different solution. Alternatively, problems that 
appear completely different may require similar methods. Unfortunately, there is no simple lookup 
table to match a particular task to a problem. A lot depends on the setting and domain. A similar 
problem in one domain may be unsolvable in another, perhaps because of lack of data. There are, 
however, a small number of tasks that are applied to a large number of methods to solve many of the 
most common problem types. In other words, in the space of all possible programming tasks, there is 
a subset of tasks that are useful to our particular problem. Within this subset, there is a smaller subset 
of tasks that are easy and can actually be applied usefully to our problem. 


Machine learning tasks occur in three broad settings: 


e Supervised learning: The goal here is to learn a model from labeled training data that allows 
predictions to be made on unseen future data. 

e Unsupervised learning: Here we deal with unlabeled data and our goal 1s to find hidden 
patterns in this data to extract meaningful information. 

e Reinforcement learning: The goal here is to develop a system that improves its performance 
based on the interactions it has with its environment. This usually involves a reward signal. This 
is similar to supervised learning, except that rather than having a labeled training set, 
reinforcement learning uses a reward function to continually improve its performance. 


Now, let's take a look at some of the major machine learning tasks. The following diagram should 
give you a Starting point to try and decide what type of task 1s appropriate for different machine 
learning problems: 
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Classification 


Classification is probably the most common type of task; this 1s due in part to the fact that it 1s 
relatively easy, well understood, and solves a lot of common problems. Classification 1s about 
assigning classes to a set of instances, based on their features. This is a supervised learning method 
because it relies on a labeled training set to learn a set of model parameters. This model can then be 
applied to unlabeled data to make a prediction on what class each instance belongs to. There are 
broadly two types of classification tasks: binary classification and multiclass classification. A 
typical binary classification task 1s e-mail spam detection. Here we use the contents of an e-mail to 
determine if 1t belongs to one of the two classes: spam or not spam. An example of multiclass 
classification is handwriting recognition, where we try to predict a class, for example, the letter 
name. In this case, we have one class for each of the alpha numeric characters. Multiclass 
classification can sometimes be achieved by chaining binary classification tasks together, however, 
we lose information this way, and we are unable to define a single decision boundary. For this 
reason, multiclass classification 1s often treated separately from binary classification. 


Regression 


There are cases where what we are interestedwovaresnok discrete classes, but a continuous variable, 
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for instance, a probability. These types of problems are regression problems. The aim of regression 
analysis 1s to understand how changes to the input, independent variables, effect changes to the 
dependent variable. The simplest regression problems are linear and involve fitting a straight line to 
a set of data 1n order to make a prediction. This is usually done by minimizing the sum of squared 
errors in each instance in the training set. Typical regression problems include estimating the 
likelihood of a disease given a range and severity of symptoms, or predicting test scores given past 
performance. 


Clustering 


Clustering 1s the most well known unsupervised method. Here, we are concerned with making a 
measurement of similarity between instances in an unlabeled dataset. We often use geometric models 
to determine the distance between instances, based on their feature values. We can use an arbitrary 
measurement of closeness to determine what cluster each instance belongs to. Clustering 1s often used 
in data mining and exploratory data analysis. There are a large variety of methods and algorithms that 
perform this task, and some of the approaches include the distance-based method, as well as finding a 
center point for each cluster, or using statistical techniques based on distributions. 


Related to clustering is association; this 1s an unsupervised task to find a certain type of pattern in the 
data. This task is behind product recommender systems such as those provided by Amazon and other 
on-line shops. 


Dimensionality reduction 


Many data sets contain a large number of features or measurements associated with each instance. 
This can present a challenge in terms of computational power and memory allocation. Also many 
features may contain redundant information or information that is correlated to other features. In these 
cases, the performance of our learning model may be significantly degraded. Dimensionality 
reduction 1s most often used in feature prepossessing; 1t compresses the data into a lower dimension 
sub space while retaining useful information. Dimensionality reduction is also used when we want to 
visualize data, typically by projecting higher dimensions onto one, two, or three dimensions. 


From these basic machine tasks, there are a number of derived tasks. In many applications, this may 
simply be applying the learning model to a prediction to establish a casual relationship. We must 
remember that explaining and predicting are not the same. A model can make a prediction, but unless 
we know explicitly how it made the prediction, we cannot begin to form a comprehensible 
explanation. An explanation requires human knowledge of the domain. 


We can also use a prediction model to find exceptions from a general pattern. Here we are interested 
in the individual cases that deviate from the predictions. This is often called anomaly detection and 
has wide applications in things like detecting bank fraud, noise filtering, and even in the search for 
extraterrestrial life. 


An important and potentially useful task 1s subgroup discovery. Our goal here 1s not, as in clustering, 
to partition the entire domain, but rather to find/@Wutaebup that has a substantially different 
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distribution. In essence, subgroup discovery 1s trying to find relationships between a dependent target 
variables and many independent explaining variables. We are not trying to find a complete 
relationship, but rather a group of instances that are different 1n ways that are important to the domain. 
For instance, establishing a subgroup, smoker = true and family history = true for a target variable 
of heart disease = true. 


Finally, we consider control type tasks. These act to optimize control settings to maximize a payoff, 
given different conditions. This can be achieved in several ways. We can clone expert behavior: the 
machine learns directly from a human and makes predictions on actions given different conditions. 
The task is to learn a prediction model for the expert's actions. This 1s similar to reinforcement 
learning, where the task is to learn a relationship between conditions and optimal action. 


Errors 


In machine learning systems, software flaws can have very serious real world consequences; what 
happens if your algorithm, embedded in an assembly line robot, classifies a human as a production 
component? Clearly, in critical systems, you need to plan for failure. There should be a robust fault 
and error detection procedure embedded in your design process and systems. 


Sometimes it 1s necessary to design very complex systems simply for the purpose of debugging and 
checking for logic flaws. It may be necessary to generate data sets with specific statistical structures, 
or create artificial humans to mimic an interface. For example, developing a methodology to verify 
that the logic of your design is sound at the data, model, and task levels. Errors can be hard to track, 
and as a scientist, you must assume that there are errors and try to prove otherwise. 


The idea of recognizing and gracefully catching errors is important for the software designer, but as 
machine learning systems designers, we must take it a step further. We need to be able to capture, in 
our models, the ability to learn from an error. 


Consideration must be given to how we select our test set, and in particular, how representative it 1s 
of the rest of the dataset. For instance, if 1t is noisy compared to the training set, it will give poor 
results on the test set, suggesting that our model is overfitting, when in fact, this 1s not the case. To 
avoid this, a process of cross validation is used. This works by randomly dividing the data into, for 
example, ten chunks of equal size. We use nine chunks for training the model and one for testing. We 
do this 10 times, using each chunk once for testing. Finally, we take an average of test set 
performance. Cross validation is used with other supervised learning problems besides 
classification, but as you would expect, unsupervised learning problems need to be evaluated 
differently. 


With an unsupervised task we do not have a labeled training set. Evaluation can therefore be a little 
tricky since we do not know what a correct answer looks like. In a clustering problem, for instance, 
we can compare the quality of different models by measures such as the ratio of cluster diameter 
compared to the distance between clusters. However, in problems of any complexity, we can never 
tell 1f there is another model, not yet built, which 1s better. 
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Optimization 


Optimization problems are ubiquitous in many different domains, such as finance, business, 
management, sciences, mathematics, and engineering. Optimization problems consist of the following: 


e An objective function that we want to maximize or minimize. 

e Decision variables, that 1s, a set of controllable inputs. These inputs are varied within the 
specified constraints in order to satisfy the objective function. 

e Parameters, which are uncontrollable or fixed inputs. 

e Constraints are relations between decision variables and parameters. They define what values 
the decision variables can have. 


Most optimization problems have a single objective function. In the cases where we may have 
multiple objective functions, we often find that they conflict with each other, for example, reducing 
costs and increasing output. In practice, we try to reformulate multiple objectives into a single 
function, perhaps by creating a weighted combination of objective functions. In our costs and output 
example, a variable along the lines of cost per unit might work. 


The decision variables are the variables we control to achieve the objective. They may include things 
such as resources or labor. The parameters of the module are fixed for each run of the model. We may 
use several cases, where we choose different parameters to test variations 1n multiple conditions. 


There are literally thousands of solution algorithms to the many different types of optimization 
problems. Most of them involve first finding a feasible solution, then iteratively improving on it by 
adjusting the decision variables to hopefully find an optimum solution. Many optimization problems 
can be solved reasonably well with linear programming techniques. They assume that the objective 
function and all the constraints are linear with respect to the decision variables. Where these 
relationships are not linear, we often use a suitable quadratic function. If the system is non-linear, then 
the objective function may not be convex. That 1s, it may have more than one local minima, and there 
is no assurance that a local minima is a global minima. 


Linear programming 


Why are linear models so ubiquitous? Firstly, they are relatively easy to understand and implement. 
They are based ona well founded mathematical theory that was developed around the mid 1700s and 
that later played a pivotal role in the development of the digital computer. Computers are uniquely 
tasked to implement linear programs because computers were conceptualized largely on the basis of 
the theory of linear programming. Linear functions are always convex, meaning they have only one 
minima. Linear Programming (LP) problems are usually solved using the simplex method. Suppose 
that we want to solve the optimization problem, we would use the following syntax: 


max x] +x 2 with constraints: 2x7 +x2<4andxj+ 2x72 <3 


We assume that x7 and x2 are greater than or equal to 0. The first thing we need to do is convert it to 
the standard form. This is done by ensuring thy Bhobieaiis.a maximization problem, that is, we 


convert min z to max -z. We also need to convert the inequalities to equalities by adding non-negative 
slack variables. The example here 1s already a maximization problem, so we can leave our objective 
function as it 1s. We do need to change the inequalities in the constraints to equalities: 


2xp+x2t+xz3=4andxj + 2x9 +x =3 
If we let z cf the objective function, we can then write the following: 
Z-x7-x2=0 


We now have the following system of linear equations: 
e Objective: z-x7-x2+0+0=0 
e Constraint 1: 2x7 txo+x3+0=4 
e Constraint 2: x7 + 2x27 +O +x4=3 


Our objective is to maximize z, remembering that all variables are non-negative. We can see that x] 
and x7 appear in all the equations and are called non-basic. The x3 and x4 value only appear in one 


equation each. They are called basic variables. We can find a basic solution by assigning all non- 
basic variables to 0. Here, this gives us the following: 
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Is this an optimum solution, remembering that our goal 1s to maximize z? We can see that since z 
subtracts x7 and x7 in the first equation in our linear system, we are able to increase these variables. 


If the coefficients 1n this equation were all non-negative, then there would be no way to increase z. 
We will know that we have found an optimum solution when all coefficients in the objective equation 
are positive. 


This 1s not the case here. So, we take one of the non-basic variables with a negative coefficient in the 
objective equation (say x7, which is called the entering variable) and use a technique called 
pivoting to turn it from a non-basic to a basic variable. At the same time, we will change a basic 
variable, called the leaving variable, into a non-basic one. We can see that x7 appears in both the 


constraint equations, so which one do we choose to pivot? Remembering that we need to keep the 
coefficients positive. We find that by using the pivot element that yields the lowest ratio of right-hand 
side of the equations to their respective entering coefficients, we can find another basic solution. For 
x], in this example, it gives us 4/2 for the first constraint and 3// for the second. So, we will pivot 


using x7 in constraint 1. 


We divide constraint / by 2, and get the following: 
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We can now write this in terms of x7, and substitute it into the other equations to eliminate x/ from 


those equations. Once we have performed a bit of algebra, we end up with the following linear 
System: 


z- 1/2x9 + 1/3 x3 =2 
xj + 1/2x%2 + 1/2x3 = 2 
3/2x9 —1/2x3+xg=1 


We have another basic solution. But, is this the optimal solution? Since we still have a minus 
coefficient in the first equation, the answer is no. We can now go through the same pivot process with 
x2, and using the ratio rule, we find that we can pivot on 3/2x2 1n the third equation. This gives us the 
following: 


z + 1/3x3+ 1/3x4 = 7/3 
x1 + 2/3x3 - 1/3 x4 = 5/3 


x2 - 1/3x3 + 2/3 x4 = 2/3 


This gives us the solution to x3 =x4 = 0,x] = 5/3, x2 = 2/3, and z = 7/3. This is the optimal solution 


because there are no more negatives in the first equation. 


We can visualize this with the following graph. The shaded area 1s the region where we will find a 
feasible solution: 
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The two variable optimization problem 


Models 


Linear programming gives us a strategy for encoding real world problems into the language of 
computers. However, we must remember that our goal is not to just solve an instance of a problem, 
but to create a model that will solve unique problems from new data. This is the essence of learning. 
A learning model must have a mechanism to evaluate its output, and in turn, change its behavior to a 
state that 1s closer to a solution. 


The model is essentially a hypothesis, that 1s, a proposed explanation of a phenomena. The goal 1s for 
it to apply a generalization to the problem. In the case of a supervised learning problem, knowledge 
gained from the training set 1s applied to the unlabeled test. In the case of an unsupervised learning 
problem, such as clustering, the system does not learn from a training set. It must learn from the 
characteristics of the data set itself, such as the degree of similarity. In both cases, the process 1s 
iterative. It repeats a well-defined set of tasks, which moves the model closer to a correct hypothesis. 


Models are the core of a machine learning system. They are what does the learning. There are many 
models, with as many variations on these models, as there are unique solutions. We can see that the 
problems machine learning systems solve (regression, classification, association, and so on) come up 
in many different settings. They have been used successfully in almost all branches of science, 


engineering, mathematics, commerce, and alsqvawthegserial sciences; they are as diverse as the 
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domains they operate in. 


This diversity of models gives machine learning systems great problem solving power. However, it 
can also be a bit daunting for the designer to decide which is the best model, or models, are for a 
particular problem. To complicate things, there are often several models that may solve your task, or 
your task may need several models. Which is the most accurate and efficient pathway through an 
original problem is something you simply cannot know when you embark upon such a project. 


For our purposes here, let's break this broad canvas into three overlapping, non-mutual, and exclusive 
categories: geometric, probabilistic, and logical. Within these three models, a distinction must be 
made regarding how a model divides up the instance space. The instance space can be considered as 
all the possible instances of your data, regardless of whether each instance appears 1n the data. The 
actual data is a subset of the space of the instance space. 


There are two approaches to dividing up this space: grouping and grading. The key difference 
between the two is that grouping models divide the instance space into fixed discrete units called 
segments. They have a finite resolution and cannot distinguish between classes beyond this 
resolution. Grading, on the other hand, forms a global model over the entire instance space, rather 
than dividing the space into segments. In theory, their resolution 1s infinite, and they can distinguish 
between instances no matter how similar they are. The distinction between grouping and grading is 
not absolute, and many models contain elements of both. For instance, a linear classifier 1s generally 
considered a grading model because it is based on a continuous function. However, there are 
instances that the linear model cannot distinguish between, for example, a line or surface parallel to 
the decision boundary. 


Geometric models 


Geometric models use the concept of instance space. The most obvious example of geometric models 
is when all the features are numerical and can become coordinates in a Cartesian coordinate system. 
When we only have two or three features, they are easy to visualize. However, since many machine 
learning problems have hundreds or thousands of features, and therefore dimensions, visualizing these 
spaces 1s impossible. However, many of the geometric concepts, such as linear transformations, still 
apply in this hyper space. This can help us better understand our models. For instance, we expect that 
many learning algorithms will be translation invariant, that is, it does not matter where we place the 
origin in the coordinate system. Also, we can use the geometric concept of Euclidean distance to 
measure any similarities between instances; this gives us a method to cluster like instances and forma 
decision boundary between them. 


Supposing we are using our linear classifier to classify paragraphs as either happy or sad and we 
have devised a set of tests. Each test is associated with a weight, w, to determine how much each test 
contributes to the overall result. 


We can simply sum up each test and multiply it by its weight to get an overall score and create a 
decision rule that will create a boundary, for example, if the happy score 1s greater than a threshold, t. 
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Each feature contributes independently to the overall result, hence the rules linearity. This 
contribution depends on each feature's relative weight. This weight can be positive or negative, and 
each individual feature is not subject to the threshold while calculating the overall score. 


We can rewrite this sum with vector notation using w for a vector of weights (w7/, w3, ..., W,) and x 
for a vector of test results (x, x2, ..., X,). Also, if we make it an equality, we can define the decision 
boundary: 


w.x=t 


We can think of w as a vector pointing between the "centers of mass" of the positive (happy) 
examples, P, and the negative examples, NV. We can calculate these centers of mass by averaging the 
following: 


P= Lay pX and N= ay nx 


Hl il 


Our aim now is to create a decision boundary half way between these centers of mass. We can see 
that w 1s proportional, or equal, to P - N, and that (P + N)/2 will be on the decision boundary. So, we 
can write the following: 


(P+N)_ (Pr - IV} 


t=(P-N)- 


— 
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Fig of Decision boundary 


In practice, real data 1s noisy and not necessarily that is easy to separate. Even when data is easily 
separable, a particular decision boundary may not have much meaning. Consider data that 1s sparse, 
such as 1n text classification where the number of words is large compared to the number of instances 
of each word. In this large area of empty instance space, it may be easy to find a decision boundary, 
but which is the best one? One way to choose 1s to use a margin to measure the distance between the 
decision boundary and its closest instance. We will explore these techniques later in the book. 


Probabilistic models 


A typical example of a probabilistic model is the Bayesian classifier, where you are given some 
training data (D), and a probability based on an initial training set (a particular hypothesis, /), getting 
the posteriori probability, P (h/D). 


P(D\h)P(h) 


P(h|D)= P(D) 
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As an example, consider that we have a bag of marbles. We know that 40 percent of them are red and 
60 percent are blue. We also know that half of the red marbles and all the blue marbles have flecks of 
white. When we reach into the bag to select a marble, we can feel by its texture that it has flecks. 
What are the chances of it being red? 


Let P(RF) be equal to the probability that a randomly drawn marble with flecks is red: 
P(FR) = the probability of a red marble with flecks is 0.5. 

P(R) = the probability a marble being red is 0.4. 

P(F) = the probability that a marble has flecks 1s 0.5 x 0.4 + 1x 0.6= 0.8. 

— P(F|R)P(R)  0.5X0.4 


= 0.25 
P(F) 0.8 


P(R|F) 


Probabilistic models allow us to explicitly calculate probabilities, rather than just a binary true or 
false. As we know, the key thing we need to do 1s create a model that maps or features a variable to a 
target variable. When we take a probabilistic approach, we assume that there is an underlying random 
process that creates a well defined but unknown probability distribution. 


Consider a spam detector. Our feature variable, X, could consist of a set of words that indicate the 
email might be spam. The target variable, Y, 1s the instance class, either spam or ham. We are 
interested 1n the conditional probability of Y given_X. For each email instance, there will be a feature 
vector, X, consisting of Boolean values representing the presence of our spam words. We are trying to 
find out whether Y, our target Boolean, is representing spam or not spam. 


Now, consider that we have two words, x7 and x, that constitute our feature vector _X. From our 
training set, we can construct a table such as the following one: 


| | P(Y = spam x,, x>)||/P(Y = not spam x), x) 
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Table 1.1 


We can see that once we begin adding more words to our feature vector, it will quickly grow 


unmanageable. With a feature vector of n size, we will have 2” cases to distinguish. Fortunately, there 
are other methods to deal with this problem, as we shall see later. 


The probabilities 1n the preceding table are known as posterior probabilities. These are used when 
we have knowledge from a prior distribution. For instance, that one 1n ten emails is spam. However, 
consider a case where we may know that_X contains x7 = /, but we are unsure of the value of x7. This 


instance could belong in row 2, where the probability of it being spam is 0.7, or in row 4, where the 
probability is 0.8. The solution is to average these two rows using the probability of x7 = / in any 


instance. That is, the probability that a word, x7, will appear in any email, spam or not: 


P(Y|x2 = 1) = Px] = 0,x2 = DP] =O + Pa] = 1x2 = DP] =D 


This is called a likelihood function. If we know, from a training set, that the probability that x7 1s one 


is 0.1 then the probability that 1t 1s zero 1s 0.9 since these probabilities must sum to 1. So, we can 
calculate the probability that an e-mail contains the spam word 0.7 * 0.9 + 0.8 * 0.1 = 0.71. 


This 1s an example of a likelihood function: P(X|Y). So, why do we want to know the probability of X, 
which is something we all ready know, conditioned on Y, which is something we know nothing about? 
A way to look at this 1s to consider the probability of any email containing a particular random 
paragraph, say, the 127th paragraph of War and Peace. Clearly, this probability 1s small, regardless of 
whether the e-mail is spam or not. What we are really interested 1n 1s not the magnitude of these 
likelihoods, but rather their ratio. How much more likely 1s an email containing a particular 
combination of words to be spam or not spam? These are sometimes called generative models 
because we can sample across all the variables involved. 


We can use Bayes' rule to transform between prior distributions and a likelihood function: 


p(x) 


P(Y) is the prior probability, that is, how likely each class 1s, before having observed_X. Similarly, 
P(X) is the probability without taking into account Y. If we have only two classes, we can work with 
ratios. For instance, if we want to know how much the data favors each class, we can use the 
following: 
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P(Y =spamX ) 7 P(XY= spam) PY = spam) 


P(Y =ham X) 7 P(X Y = ham) P(Y =ham) 


If the odds are less than one, we assume that the class in the denominator is the most likely. If itis 

greater than one, then the class in the enumerator is the most likely. If we use the data from Zable 1.1, 

we calculate the following posterior odds: 

PUY =spam% = 0%, =0) = Mae. 0.11 
PY =hamx, =0,x, = 0) 0.9 


Ply = spans, —L4— 1) 60.8 | ma 
P(Y =hamx,=1,x,=1) 0.2 


Ply =spani x, =0,%, = 1) ee > 3 
P(Y =hamx, = 0), x, =) eS 


PUy =spams —1,%5-— 
iY =spen% = 1x, =—0) _ 04 — 0.66 
P(Y =hamx, =1,x, = 0) 0.6 


The likelihood function is important 1n machine learning because it creates a generative model. If we 
know the probability distribution of each word in a vocabulary, together with the likelihood of each 
one appearing in either a spam or not spam e-mail, we can generate a random spam e-mail according 
to the conditional probability, P(X|Y = spam). 


Logical models 


Logical models are based on algorithms. They can be translated into a set of formal rules that can be 
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understood by humans. For example, if both x7 and x7 are | then the email is classified as spam. 


These logical rules can be organized into a tree structure. In the following figure, we see that the 
instance space 1s iteratively partitioned at each branch. The leaves consist of rectangular areas (or 
hyper rectangles in the case of higher dimensions) representing segments of the instance space. 
Depending on the task we are solving, the leaves are labeled with a class, probability, real number, 
and so on. 


p(spam)=0.4 





The figure feature tree 


Feature trees are very useful when representing machine learning problems; even those that, at first 
sight, do not appear to have a tree structure. For instance, in the Bayes classifier 1n the previous 
section, we can partition our instance space into as many regions as there are combinations of feature 
values. Decision tree models often employ a pruning technique to delete branches that give an 
incorrect result. In Chapter 3, Turning Data into Information, we will look at a number of ways to 
represent decision trees 1n Python. 


Note 
Note that decision rules may overlap and make contradictory predictions. 


They are then said to be logically inconsistentWRWie8ean also be incomplete when they do not take 
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into account all the coordinates 1n the feature space. There are a number of ways that we can address 
these issues, and we will look at these in detail later 1n the book. 


Since tree learning algorithms usually work in a top down manner, the first task 1s to find a good 
feature to split on at the top of the tree. We need to find a split that will result in a higher degree of 
purity in subsequent nodes. By purity, I mean the degree to which training examples all belong to the 
same class. As we descend down the tree, at each level, we find the training examples at each node 
increase in purity, that is, they increasingly become separated into their own classes until we reach 
the leaf where all examples belong to the same class. 


To look at this in another way, we are interested 1n lowering the entropy of subsequent nodes in our 
decision tree. Entropy, a measure of disorder, is high at the top of the tree (the root) and is 
progressively lowered at each node as the data is divided up into its respective classes. 


In more complex problems, those with larger feature sets and decision rules, finding the optimum 
splits 1s sometimes not possible, at least not 1n an acceptable amount of time. We are really interested 
in creating the shallowest tree to reach our leaves in the shortest path. In the time it takes to analyze, 
each node grows exponentially with each additional feature, so the optimum decision tree may take 
longer to find than actually using a sub-optimum tree to perform the task. 


An important property of logical models is that they can, to some extent, provide an explanation for 
their predictions. For example, consider the predictions made by a decision tree. By tracing the path 
from leaf to root we can determine the conditions that resulted in the final result. This 1s one of the 
advantages of logical models: they can be inspected by a human to reveal more about the problem. 


Features 


In the same way that decisions are only as good as the information available to us in real life, ina 
machine learning task, the model is only as good as its features. Mathematically, features are a 
function that maps from the instance space to a set of values in a particular domain. In machine 
learning, most measurements we make are numerical, and therefore the most common feature domain 
is the set of real numbers. Other common domains include Boolean, true or false, integers (say, when 
we are counting the occurrence of a particular feature), or finite sets such as a set of colors or shapes. 


Models are defined in terms of their features. Also, single features can be turned into a model, which 
is known as a univariate model. We can distinguish between two uses of features. This is related to 
the distinction between grouping and grading. 


Firstly, we can group our features by zooming into an area in the instance space. Let f be a feature 
counting the number of occurrences of a word, x, 1n an e-mail, XY. We can set up conditions such as 


the following: 


Where /(X) =0, representing emails that do not contain x7 or where f/(X)>0 representing emails that 
contain x7 one or more times. These conditions are called binary splits because they divide the 
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instance space into two groups: those that satisfy the condition and those that don't. We can also split 
the instance space into more than two segments to create non-binary splits. For instance, where f(X) = 
0;0 < F(X) < 5; F(X) > 5, and so on. 


Secondly, we can grade our features to calculate the independent contribution each one makes to the 
overall result. Recall our simple linear classifier, the decision rule of the following form: 


iT 
2, MW i X; <f 
f=] 


Since this rule 1s linear, each feature makes an independent contribution to the score of an instance. 
This contribution depends on w;. If itis positive, then a positive x; will increase the score. If w; is 


negative, a positive x; decreases the score. If w; is small or zero, then the contribution it makes to the 


overall result is negligible. It can be seen that the features make a measurable contribution to the final 
prediction. 


These two uses of features, as splits (grouping) and predictors (grading), can be combined into one 
model. A typical example occurs when we want to approximate a non-linear function, say y sin z x, 
on the interval, -/ <x < J. Clearly, the simple linear model will not work. Of course, the simple 
answer 1s to split the x axis into -/ <x 0 and 0 <. On each of these segments, we can find a 
reasonable linear approximation. 





Using grouping and grading 


A lot of work can be done to improve our model's performance by feature construction and 
transformation. In most machine learning EIN So ie, eatures are not necessarily explicitly 


available. They need to be constructed from raw datasets and then transformed into something that our 
model can make use of. This 1s especially important in problems such as text classification. In our 
simple spam example, we used what 1s known as a bag of words representation because it disregards 
the order of the words. However, by doing this, we lose important information about the meaning of 
the text. 


An important part of feature construction 1s discretization. We can sometimes extract more 
information, or information that is more relevant to our task, by dividing features into relevant chunks. 
For instance, supposing our data consists of a list of people's precise incomes, and we are trying to 
determine whether there is a relationship between financial income and the suburb a person lives 1n. 
Clearly, 1t would be appropriate 1f our feature set did not consist of precise incomes but rather ranges 
of income, although strictly speaking, we would lose information. If we choose our intervals 
appropriately, we will not lose information related to our problem, and our model will perform better 
and give us results that are easier to interpret. 


This highlights the major tasks of feature selection: separating the signal from the noise. 


Real world data will invariably contain a lot of information that we do not need, as well as just plain 
random noise, and separating the, perhaps small, part of the data that 1s relevant to our needs 1s 
important to the success of our model. It 1s of course important that we do not throw out information 
that may be important to us. 


Often, our features will be non-linear, and linear regression may not give us good results. A trick is to 
transform the instance space itself. Supposing we have data such as what is shown in the following 
figure. Clearly, linear regression only gives us a reasonable fit, as shown 1n the figure on the left-hand 
side. However, we can improve this result if we square the instance space, that is, we make x = x7 


and y = y7, as shown in the figure on the right-hand side: 





Variance = .92 Variance = .97 


Ti ransforming tf the instance space 
! eBook 
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We can go further and use a technique called the kernel trick. The idea is that we can create a higher 
dimensional implicit feature space. Pairs of data points are mapped from the raw dataset to this 
higher dimensional space via a specified function, sometimes called a similarity function. 


For instance, letx7 = (xj, yj) and x72 = (x2, vy). 


We create a 2D to 3D mapping, shown as follows: 


(x, yo <.¥ V2xy)] 


The points in the 3D space corresponding to the 2D points, x7 and x, are as follows: 


on oe ar [> a5 Toe! (ee a 
Ay — (x, 2 . 2x,y, } A5 a (x3 » V5 2 2x,¥,) 


‘and 


Now, the dot product of these two vectors 1s: 


“; 


ee ee £2 | | es 
X, a — ‘ A5 + yy V5 - 206 ¥) X5 V5 = (x, A-4 = yy V5 ) — (x, ex, ) 


We can see that by squaring the dot product in the original 2D space, we obtain the dot product in the 
3D space without actually creating the feature vectors themselves. Here, we have defined the kernel 
k(x 1,x2) = (x 1,x2)2. Calculating the dot product in a higher dimensional space is often 
computationally cheaper, and as we will see, this technique is used quite widely 1n machine learning 
from Support Vector Machines (SVM), Principle Component Analysis (PCA), and correlation 
analysis. 


The basic linear classifier we looked at earlier defines a decision boundary, w * x = ¢t. The vector, w, 
is equal to the difference between the mean of the positive example and the mean of the negative 
examples, p-n. Suppose that we have the points n= (0,0) and p = (0,1). Let's assume that we have 
obtained a positive mean from two training examples, p/ = (-1,1) and p2 = (/,1). Therefore, we have 
the following: 


WOW! eBook 
www.wowebook.org 


{f= 5 (A + P,) 


a 


We can now write the decision boundary as the following: 


— Pex — DD * i — ek = [ 
; i le 


Using the kernel trick, we can obtain the following decision boundary: 


=A ( p, * x) F : k ( P> ’ x) 7 k (72, x) =# 


With the kernel we defined earlier, we get the following: 


7 


k(p,.x)=(—x+y) , k(p,,x)=(x+y) and k(n,x)=0 


We can now derive the decision boundary: 


7 } 


= (—x+y) + = (x+y) 


= 


Z Z 
=x +y 


This is simply a circle around the origin with a radius V¢. 


Using the kernel trick, on the other hand, each new instance 1s evaluated against each training 
example. In return for this more complex calculation we obtain a more flexible non-linear decision 
boundary. 


A very interesting and important aspect is the interaction between features. One form of interaction 1s 
correlation. For example, words in a blog posi gwhege we might perhaps expect there to be a positive 
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correlation between the words winter and cold, and a negative correlation between winter and hot. 
What this means for your model depends on your task. If you are doing a sentiment analysis, you might 
want to consider reducing the weights of each word 1f they appear together since the addition of 
another correlated word would be expected to contribute marginally less weight to the overall result 
than if that word appeared by itself. 


Also with regards to sentiment analysis, we often need to transform certain features to capture their 
meaning. For example, the phrase not happy contains a word that would, 1f we just used /-grams, 
contribute to a positive sentiment score even though its sentiment is clearly negative. A solution (apart 
from using 2-grams, which may unnecessarily complicate the model) would be to recognize when 
these two words appear 1n a sequence and create a new feature, not happy, with an associated 
sentiment score. 


Selecting and optimizing features is time well spent. It can be a significant part of the design of 
learning systems. This iterative nature of design flips between two phases. Firstly, understanding the 
properties of the phenomena you are studying, and secondly, testing your ideas with experimentation. 
This experimentation gives us deeper insight into the phenomena, allowing us to optimize our features 
and gain deeper understanding, among other things, until we are satisfied about our model giving us 
an accurate reflection of reality. 
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Unified modeling language 


Machine learning systems can be complex. It is often difficult for a human brain to understand all the 
interactions of a complete system. We need some way to abstract the system into a set of discrete 
functional components. This enables us to visualize our system's structure and behavior with diagrams 
and plots. 


UML is a formalism that allows us to visualize and communicate our design ideas 1n a precise way. 
We implement our systems in code, and the underlying principles are expressed in mathematics, but 
there 1s a third aspect, which 1s, in a sense, perpendicular to these, and that 1s a visual representation 
of our system. The process of drawing out your design helps conceptualize it from a different 
perspective. Perhaps we could consider trying to triangulate a solution. 


Conceptual models are theoretical devices for describing elements of a problem. They can help us 
clarify assumptions, prove certain properties, and give us a fundamental understanding of the 
structures and interactions of systems. 


UML arose out of the need to both simplify this complexity and allow our designs to be communicated 
clearly and unambiguously to team members, clients, and other stakeholders. A model is a simplified 
representation of a real system. Here, we use the word model ina more general sense, as compared 
to 1ts more precise machine learning definition. UML can be used to model almost any system 
imaginable. The core idea is to strip away any irrelevant and potentially confusing elements with a 
clear representation of core attributes and functions. 


Class diagrams 


The class diagram models the static structure of a system. Classes represent abstract entities with 
common characteristics. They are useful because they express, and enforce, an object-oriented 
approach to our programming. We can see that by separating distinct objects in our code, we can 
work more clearly on each object as a self-contained unit. We can define it with a specific set of 
characteristics, and define how it relates to other objects. This enables complex programs to be 
broken down into separate functional components. It also allows us to subclass objects via 
inheritance. This is extremely useful and mirrors how we model the particularly hierarchical aspect 
of our world (that 1s, programmer is a subclass of human, and Python programmer is a subclass of 
programmer). Object programming can speed up the overall development time because it allows the 
reuse of components. There is a rich class library of developed components to draw upon. Also, the 
code produced tends to be easier to maintain because we can replace or change classes and are able 
to (usually) understand how this will affect the overall system. 


In truth, object coding does tend to result in a larger code base, and this can mean that programs will 
be slower to run. In the end, it 1s not an "either, or" situation. For many simple tasks, you probably do 
not want to spend the time creating a class if you may never use it again. In general, 1f you find 
yourself typing the same bits of code, or creating the same type of data structures, it is probably a 
good idea to create a class. The big advantag“OPObFeet programming is that we can encapsulate the 
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data and the functions that operate on the data in one object. These software objects can correspond in 
quite a direct way with real world objects. 


Designing object-oriented systems may take some time, initially. However, while establishing a 
workable class structure and class definitions, the coding tasks required to implement the class 
becomes clearer. Creating a class structure can be a very useful way to begin modeling a system. 
When we define a class, we are interested in a specific set of attributes, as a subset of all possible 
attributes or actual irrelevant attributes. It should be an accurate representation of a real system, and 
we need to make the judgment as to what is relevant and what is not. This is difficult because real 
world phenomena are complex, and the information we have about the system is always incomplete. 
We can only go by what we know, so our domain knowledge (the understanding of the system(s) we 
are trying to model), whether it be a software, natural, or human, is critically important. 


Object diagrams 


Object diagrams are a logical view of the system at runtime. They are a snapshot at a particular 
instant in time and can be understood as an instance of a class diagram. Many parameters and 
variables change value as the program 1s run, and the object diagram's function is to map these. This 
runtime binding 1s one of the key things object diagrams represent. By using links to tie objects 
together, we can model a particular runtime configuration. Links between objects correspond to 
associations between the objects class. So, the link is bound by the same constraints as the class that 
it enforces on its object. 


Training Data - Cross Validation 


X = iris.data XY 
Y= Iris.target TestSize= 0.4 








The object diagram 


Both, the class diagram and the object diagram, are made of the same basic elements. While the class 
diagram represents an abstract blueprint of the class. The object diagram represents the real state of 

an object at a particular point in time. A single-object diagram cannot represent every class instance, 
so when drawing these diagrams, we must confine ourselves to the important instances and instances 

that cover the basic functionality of the system. The object diagram should clarify the association 


between objects and indicate the values of important variables. 
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Activity diagrams 


The purpose of an activity diagram is to model the system's work flow by chaining together separate 
actions that together represent a process. They are particularly good at modeling sets of coordinated 
tasks. Activity diagrams are one of the most used in the UML specification because they are intuitive 
to understand as their formats are based on traditional flow chart diagrams. The main components of 
an activity diagram are actions, edges (Sometimes called paths) and decisions. Actions are 
represented by rounded rectangles, edges are represented by arrows, and decisions are represented 
by a diamond. Activity diagrams usually have a start node and an end node. 


& Start 


= ee 


Convert linear 
equations to 
standard form 


Assign non 
basic variables 
to 0 


——— 


Choose entering 
and leaving 
Vanables 


oe 


Pivot 


v Minus coefticient 
in objective’? 
No 





A figure of an example activity diagram 


State diagrams 
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State diagrams are used to model systems that change behavior depending on what state they are in. 
They are represented by states and transitions. States are represented by rounded rectangles and 
transitions by arrows. Each transition has a trigger, and this is written along the arrow. 


Many state diagrams will include an initial pseudo state and a final state. Pseudo states are states that 
control the flow of traffic. Another example is the choice pseudo state. This indicates that a Boolean 
condition determines a transition. 


A state transition system consists of four elements; they are as follows: 


S = {s], 52, ...f: A set of states 
A= {a], a2, ...~: A set of actions 


FE ={e], e2, ...f: A set of events 
y. S(A U E)— 2s: A state transition function 


The first element, S, 1s the set of all possible states the world can be in. Actions are the things an 
agent can do to change the world. Events can happen in the world and are not under the control of an 
agent. The state transition function, y, takes two things as input: a state of the world and the union of 
actions and events. This gives us all the possible states as a result of applying a particular action or 
event. 


Consider that we have a warehouse that stocks three 1tems. We consider the warehouse only stocks, at 
most, one of each item. We can represent the possible states of the warehouse by the following 
matrix: 


GO 1lGt i ee 1 ft 
s=0 0 | Fi ® @ J 
vow tidil 


This can define similar binary matrices for £, representing the event sold, and A, which is an action 
order. 


In this simple example, our transition function is applied to an instance (s, which is a column in S), 
whichis s’=s + a- e, where s’1s the system's final state, s 1s its initial state, and a and e are an 
activity and an event respectively. 


We can represent this with the following transition diagram: 
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The figure of a transition Diagram 
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Summary 


So far, we have introduced a broad cross-section of machine learning problems, techniques, and 
concepts. Hopefully by now, you have an idea of how to begin tackling a new and unique problem by 
breaking it up into its components. We have reviewed some of the essential mathematics and explored 
ways to visualize our designs. We can see that the same problem can have many different 
representations, and that each one may highlight different aspects. Before we can begin modeling, we 
need a well-defined objective, phrased as a specific, feasible, and meaningful question. We need to 
be clear how we can phrase the question in a way that a machine can understand. 


The design process, although consisting of different and distinct activities, is not necessarily a linear 
process, but rather more of an iterative one. We cycle through each particular phase, proposing and 
testing ideas until we feel we can jump to the next phase. Sometimes we may jump back to a previous 
stage. We may sit at an equilibrium point, waiting for a particular event to occur; we may cycle 
through stages or go through several stages in parallel. 


In the next chapter, we will begin our exploration of the practical tools that are available in the 
various Python libraries. 


WOW! eBook 
www.wowebook.org 


Chapter 2. Tools and Techniques 


Python comes equipped with a large library of packages for machine learning tasks. 


The packages we will look at in this chapter are as follows: 


The [Python console 

NumPy, which is an extension that adds support for multi-dimensional arrays, matrices, and 
high-level mathematical functions 

SciPy, which is a library of scientific formulae, constants, and mathematical functions 
Matplotlib, which 1s for creating plots 

Scikit-learn, which 1s a library for machine learning tasks such as classification, regression, and 
clustering 


There 1s only enough space to give you a flavor of these huge libraries, and an important skill is being 
able to find and understand the reference material for the various packages. It is impossible to present 
all the different functionality in a tutorial style documentation, and it 1s important to be able to find 
your way around the sometimes dense API references. A thing to remember 1s that the majority of 
these packages are put together by the open source community. They are not monolithic structures like 
you would expect from a commercial product, and therefore, understanding the various package 
taxonomies can be confusing. However, the diversity of approaches of open source software, and the 
fact that ideas are being contributed continually, give it an important advantage. 


However, the evolving quality of open source software has its down side, especially for ML 
applications. For example, there was considerable reluctance on behalf of the Python machine 
learning user community to move from Python 2 to 3. Because Python 3 broke backwards 
compatibility; importantly, 1n terms of its numerical handling, it was not a trivial process to update the 
relevant packages. At the time of writing, all of the important (well important for me!) packages, and 
all those used in this book, were working with Python 2.7 or 3x. The major distributions of Python 
have Python 3 versions with a slightly different package set. 
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Python for machine learning 


Python 1s a versatile general purpose programming language. It 1s an interpreted language and can run 
interactively froma console. It does not require a compiler like C++ or Java, so the development 
time tends to be shorter. It is available for free download and can be installed on many different 
operating systems including UNIX, Windows, and Macintosh. It is especially popular for scientific 
and mathematical applications. Python is relatively easy to learn compared to languages such as C++ 
and Java, with similar tasks using fewer lines of code. 


Python 1s not the only platform for machine learning, but it is certainly one of the most used. One of its 
major alternatives is R. Like Python, it is open source, and while it is popular for applied machine 
learning, it lacks the large development community of Python. R 1s a specialized tool for machine 
learning and statistical analysis. Python 1s a general-purpose, widely-used programming language that 
also has excellent libraries for machine learning applications. 


Another alternative is Matlab. Unlike R and Python, it is a commercial product. As would be 
expected, it contains a polished user interface and exhaustive documentation. Like R, however, it 
lacks the versatility of Python. Python is such an incredibly useful language that your effort to learn it, 
compared to the other platforms, will provide far greater pay-ofts. It also has excellent libraries for 
network, web development, and microcontroller programming. These applications can complement or 
enhance your work 1n machine learning, all without the pain of clumsy integrations and the learning or 
remembering of the specifics of different languages. 
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IPython console 


The Ipython package has had some significant changes with the release of version 4. A former 
monolithic package structure, it has been split into sub-packages. Several [Python projects have split 
into their own separate project. Most of the repositories have been moved to the Jupyter project 


(jupyter.org). 


At the core of [Python is the [Python console: a powerful interactive interpreter that allows you to test 
your ideas 1n a very fast and intuitive way. Instead of having to create, save, and run a file every time 
you want to test a code snippet, you can simply type it into a console. A powerful feature of [Python 1s 
that 1t decouples the traditional read-evaluate-print loop that most computing platforms are based on. 
[Python puts the evaluate phase into its own process: a kernel (not to be confused with the kernel 
function used in machine learning algorithms). Importantly, more than one client can access the kernel. 
This means you can run code in a number of files and access them, for example, running a method 
from the console. Also, the kernel and the client do not need to be on the same machine. This has 
powerful implications for distributed and networked computing. 


The [Python console adds command-line features, such as tab completion and magic commands, 
which replicate terminal commands. If you are not using a distribution of Python with [Python already 
installed, you can start [Python by typing i python into a Python command line. Typing squickref 
into the [Python console will give you a list of commands and their function. 


The [Python notebook should also be mentioned. The notebook has merged into another project known 
as Jupyter (jupyter.org). This web application is a powerful platform for numerical computing in over 
40 languages. The notebook allows you to share and collaborate on live code and publish rich 
graphics and text. 
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Installing the SciPy stack 


The SciPy stack consists of Python along with the most commonly used scientific, mathematical, and 
ML libraries. (visit: scipy.org). These include NumPy, Matplotlib, the SciPy library itself, and 
IPython. The packages can be installed individually on top of an existing Python installation, or as a 
complete distribution (distro). The easiest way to get started is using a distro, if you have not got 
Python installed on your computer. The major Python distributions are available for most platforms, 
and they contain everything you need in one package. Installing all the packages and their 
dependencies separately does take some time, but 1t may be an option if you already have a 
configured Python installation on your machine. 


Most distributions give you all the tools you need, and many come with powerful developer 
environments. Two of the best are Anaconda (www.continuum.i0/downloads) and Canopy 


(http://www.enthought.com/products/canopy/). Both have free and commercial versions. For 


reference, I will be using the Anaconda distribution of Python. 
Installing the major distributions is usually a pretty painless task. 
Tip 


Be aware that not all distributions include the same set of Python modules, and you may have to 
install modules, or reinstall the correct version of a module. 
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NumPY 


We should know that there 1s a hierarchy of types for representing data in Python. At the root are 
immutable objects such as integers, floats, and Boolean. Built on this, we have sequence types. These 
are ordered sets of objects indexed by non-negative integers. They are iterative objects that include 
strings, lists, and tuples. Sequence types have a common set of operations such as returning an 
element (s/i/) or a slice (s/i:7/), and finding the length (/en(s)) or the sum (sum(s)). Finally, we have 
mapping types. These are collections of objects indexed by another collection of key objects. 
Mapping objects are unordered and are indexed by numbers, strings, or other objects. The built-in 
Python mapping type is the dictionary. 


NumPy builds on these data objects by providing two further objects: an N-dimensional array object 
(ndarray) and a universal function object (ufunc). The ufunc object provides element-by-element 
Operations on ndarray objects, allowing typecasting and array broadcasting. Typecasting is the 
process of changing one data type into another, and broadcasting describes how arrays of different 
sizes are treated during arithmetic operations. There are sub-packages for linear algebra (1inalg), 
random number generation (random), discrete Fourier transforms (£ft), and unit testing (testing). 


NumPy uses a dt ype object to describe various aspects of the data. This includes types of data such 
as float, integer, and so on, the number of bytes in the data type (if the data 1s structured), and also, the 
names of the fields and the shape of any sub arrays. NumPy has several new data types, including the 
following: 


8, 16, 32, and 64 bit int values 
16, 32, and 64 bit float values 
64 and 128 bit complex types 
Ndarray Structured array types 


We can convert between types using the np.cast object. This 1s simply a dictionary that 1s keyed 
according to destination cast type, and whose value is the appropriate function to perform the casting. 
Here we cast an integer to a float32: 


f= np.cast[f'] (2) 


NumPy arrays can be created 1n several ways such as converting them from other Python data 
structures, using the built-in array creation objects such as arange(), ones() and zeros (), or from 
files suchas .csv or .html. 


Indexing and slicingNumPy builds on the slicing and indexing techniques used in sequences. You 
should already be familiar with slicing sequences, such as lists and tuples, in Python using the 
[i:j:k] syntax, where i 1s the start index, j 1s the end, and k is the step. NumPy extends this concept 
of the selection tuple to N-dimensions. 


Fire up a Python console and type the followi#®Uénamalnds: 
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import numpy as np 
a=np.arange (60) .reshape(3,4,5) 
print (a) 


You will observe the following: 





This will print the preceding 3 by 4 by 5 array. You should know that we can access each item in the 
array using a notation such as a[2,3,4]. This returns 59. Remember that indexing begins at 0. 


We can use the slicing technique to return a slice of the array. 


The following image shows the A[1:2:] array: 
array(L[[26, 


34] 


34] 
39]]]) 





Using the ellipse (...), we can select any remaining unspecified dimensions. For example, a[...,1] 
is equivalent to a[:,:,11]: 
array({I[ 1, 6, 11, 16], 


[21, 26, 31, 36], 
[41, 46, 51, 56]])) 


You can also use negative numbers to count from the end of the axis: 


In [5]: al-1:,: 


,-5] 
Out[5]: array([[46, 45, 50, 55]]) 
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With slicing, we are creating views; the original array remains untouched, and the view retains a 
reference to the original array. This means that when we create a slice, even though we assign it to a 
new variable, if we change the original array, these changes are also reflected 1n the new array. The 
following figure demonstrates this: 


- b=al?,?,0:2] 


: b 
: array([50, 51]) 


: al2]=0 #changing a changes b 





- ob 
: array(Lo, @]) 


Here, a and b are referring to the same array. When we assign values in a, this is also reflected in b. 
To copy an array rather than simply make a reference to it, we use the deep copy () function from the 
copy package in the standard library: 


import copy 
c=copy.deepcopy (a) 


Here, we have created a new independent array, c. Any changes made in array a will not be reflected 
in array c. 
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Constructing and transforming arrays 


This slicing functionality can also be used with several NumPy classes as an efficient means of 
constructing arrays. The numpy.mgrid object, for example, creates a meshgrid object, which 
provides, 1n certain circumstances, a more convenient alternative to arange(). Its primary purpose 1s 
to build a coordinate array for a specified N-dimensional volume. Refer to the following figure as an 
example: 

In [10]: np.mgrid[@:4,0:4] 


Outl1e]: 
array([[[o, 


0 
1 
2 
3 


1 
1 
1 
1 





Sometimes, we will need to manipulate our data structures in other ways. These include: 


e concatenating: By using the np.r_andnp.c_ functions, we can concatenate along one or two 
axes using the slicing constructs. Here 1s an example: 
In: [11] 5 np.r_[-=2)-1:5},2] 


Outl[11]: array([-2.4+6.], -1.4+0.], 2.4+0.]7]) 





Here we have used the complex number 5j as the step size, which is interpreted by Python as the 
number of points, inclusive, to fit between the specified range, which here is -1 to 1. 
e newaxis: This object expands the dimensions of an array: 


In [12]: alnp.newaxis,:,:].shape 


Out[1?]: (1, 3, 4, 5) 





This creates an extra axis in the first dimension. The following creates the new axis in the 
second dimension: 


In [13]: al:,np.newaxis,:].shape 


Outlis]: (3, 1, 4, 5) 
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You can also use a Boolean operator to filter: 


a[la<5] 
Out[]: array([0, 1, 2, 3, 4]) 


e Find the sum of a given axis: 


In [14]: a.sum(2) 
Out[14]: 
array({[[ 10, 35, 


[116, 135, 
Q, O, 





Here we have summed using axis 2. 
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Mathematical operations 


As you would expect, you can perform mathematical operations such as addition, subtraction, 
multiplication, as well as the trigonometric functions on NumPy arrays. Arithmetic operations on 
different shaped arrays can be carried out by a process known as broadcasting. When operating on 
two arrays, NumPy compares their shapes element-wise from the trailing dimension. Two dimensions 
are compatible if they are the same size, or if one of them is 1. If these conditions are not met, then a 
ValueError exception 1s thrown. 


This 1s all done in the background using the ufunc object. This object operates on ndarrays ona 
element-by-element basis. They are essentially wrappers that provide a consistent interface to scalar 
functions to allow them to work with NumPy arrays. There are over 60 ufunc objects covering a 
wide variety of operations and types. The ufunc objects are called automatically when you perform 
operations such as adding two arrays using the + operator. 


Let's look into some additional mathematical features: 


e Vectors: We can also create our own vectorized versions of scalar functions using the 
np.vectorize() function. It takes a Python scalar function or method as a parameter and 
returns a vectorized version of this function: 


def myfunc(a,b): 
def myfunc(a,b): 
1f a> bD: 
return a-b 
else: 
return a+b 
vfunc=np.vectorize (myfunc) 


We will observe the following output: 


In [18]: vfunc([1,2,3,4],[4,3,2,1]) 


Out[18): array([5, 5, 1, 3]) 





e Polynomial functions: The polyid class allows us to deal with polynomial functions ina 
natural way. It accepts as a parameter an array of coefficients in decreasing powers. For 


example, the polynomial, 2x? + 3x + 4, can be entered by the following: 
In [27]: p=np.polyld([2,3,4]) 


In [28]: print(np.polyld(p)) 
2 


# ¥ + 3 8 + 4 
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We can see that it prints out the polynomial in a human-readable way. We can perform various 
operations on the polynomial, such as evaluating at a point: 

In [29]: p(3) 

Out[29]: 31 


e Find the roots: 


In [36]: p.r | 
Out(30]: array([-@.754+1.19895788], 





-@.75-1.19895788)]) 


We can use asarray (p) to give the coefficients of the polynomial an array so that it can be used in 
all functions that accept arrays. 


As we will see, the packages that are built on NumPy give us a powerful and flexible framework for 
machine learning. 
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Matplotlib 


Matplotlib, or more importantly, its sub-package PyPlot, is an essential tool for visualizing two- 
dimensional data in Python. I will only mention it briefly here because its use should become 
apparent as we work through the examples. It is built to work like Matlab with command style 
functions. Each pyPlot function makes some change to a PyPlot instance. At the core of PyPlot 1S 
the plot method. The simplest implementation 1s to pass plot a list or a 1D array. If only one 
argument is passed to plot, it assumes 1t is a Sequence of y values, and it will automatically generate 
the x values. More commonly, we pass plot two 1D arrays or lists for the co-ordinates x and y. The 
plot method can also accept an argument to indicate line properties such as line width, color, and 
style. Here is an example: 


import numpy as np 
import matplotlib.pyplot as plt 


x = np.arange(0., 5., 0.2) 
plt.plot(x, x**4, 'r', x, x*90, 'bs', x, x**3, 'g%') 
plt.show () 


This code prints three lines in different styles: a red line, blue squares, and green triangles. Notice 
that we can pass more than one pair of coordinate arrays to plot multiple lines. For a full list of line 
styles, type the help (plt.plot) function. 


Pyplot, like Matlab, applies plotting commands to the current axes. Multiple axes can be created 
using the subplot command. Here is an example: 


xl = np.arange(0., 5., 0.2 
x2 = np.arange(0., 5., 0.1 


plt. figure (1) 
plt.subplot(211) 
plt.plot(x1l, x1**4, 'r', x1, x1*90, 'bs', x1, x1**3, 'g*',linewidth=2.0) 


plt.subplot (212) 


plt.plot(x2,np.cos(2*np.pi*x2), 'k') 
plt.show () 


The output of the preceding code 1s as follows: 
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Another useful plot is the histogram. The hist () object takes an array, or a sequence of arrays, of 
input values. The second parameter is the number of bins. In this example, we have divided a 
distribution into 10 bins. The normed parameter, when set to 1 or true, normalizes the counts to form 
a probability density. Notice also that in this code, we have labeled the x and y axis, and displayed a 
title and some text at a location given by the coordinates: 


mu, Sigma = 100, 15 

x = mu + sigma * np.random.randn (1000) 

n, bins, patches = plt.hist(x, 10, normed=1, facecolor='g') 
plt.xlabel ('Frequency' ) 

plt.ylabel('Probability' ) 

plt.title('Histogram Example' ) 

plt.text(40,.028, 'mean=100 std.dev.=15') 

plt.axis([40, 160, 0, 0.03]) 

plt.grid (True) 

plt.show() 


The output for this code will look like this: 
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Histogram Example 
néan=100 std. dev=15 


£ 
= 
a 
aa 
ha 
cL 


100 
Frequency 





The final 2D plot we are going to look at is the scatter plot. The scatter object takes two sequence 
objects, such as arrays, of the same length and optional parameters to denote color and style 
attributes. Let's take a look at this code: 


N = 100 

np.random. rand (N) 

y = np.random.rand(N) 

#colors = np.random.rand (N) 

colors=('r','b','g') 

area = np.pi * (10 * np.random.rand(N))**2 # 0 to 10 point radiuses 
plt.scatter(x, y, s=area, c=colors, alpha=0.5) 

plt.show () 


x 


We will observe the following output: 
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Matplotlib also has a powerful toolbox for rendering 3D plots. The following code demonstrations 
are simple examples of 3D line, scatter, and surface plots. 3D plots are created in a very similar way 
to 2D plots. Here, we get the current axis with the gca function and set the projection parameter to 
3D. All the plotting methods work much like their 2D counterparts, except that they now take a third 
set of input values for the z axis: 


import matplotlib as mpl 

from mpl toolkits.mplot3d import Axes3D 
import numpy as np 

import matplotlib.pyplot as plt 

from matplotlib import cm 


mpl.rcParams['legend.fontsize'] = 10 


fig = plt.figure() 

ax = fig.gca(projection='3d') 

theta = np.linspace(-3 * np.pi, 6 * np.pi, 100) 
z = np.linspace(-2, 2, 100) 

r= z**2 + 1 

x = r * np.sin(theta) 

y = r * np.cos (theta) 

ax.plot(x, y, Z) 


theta2 = np.linspace(-3 * np.pi, 6 * np.pi, 20) 
z2 = np.linspace(-2, 2, 20) 

r2=z2**2 +1 

x2 = r2 * np.sin(theta2) 

y2 = r2 * np.cos (theta2) 


ax.scatter(x2,y2,z2, c= 'r') 

x3 = np.arange(-5, 5, 0.25) 

y3 = np.arange(-5, 5, 0.25) 

x3, y3 = np.meshgrid(x3, y3) 

R = np.sqrt(x3**2 + y3**2) 

z3 = np.sin(R) 

surf = ax.plot surface (x3,y3,z3, rstride=1, cstride=1, cmap=cm.Greys r, 
linewidth=0, antialiased=False) 

ax.set zlim(-2, 2) 

plt.show () 


We will observe this output: 


WOW! eBook 
www.wowebook.org 





WOW! eBook 
www.wowebook.org 


Pandas 


The Pandas library builds on NumPy by introducing several useful data structures and functionalities 
to read and process data. Pandas 1s a great tool for general data munging. It easily handles common 
tasks such as dealing with missing data, manipulating shapes and sizes, converting between data 
formats and structures, and importing data from different sources. 


The main data structures introduced by Pandas are: 


e Series 
e The DataFrame 
e Panel 


The DataFrame 1s probably the most widely used. It 1s a two-dimensional structure that 1s effectively 
a table created from either a NumPy array, lists, dicts, or series. You can also create a DataFrame by 
reading froma file. 


Probably the best way to get a feel for Pandas is to go through a typical use case. Let's say that we are 
given the task of discovering how the daily maximum temperature has changed over time. For this 
example, we will be working with historical weather observations from the Hobart weather station in 
Tasmania. Download the following ZIP file and extract its contents into a folder called data in your 
Python working directory: 


http://davejulian.net/mlbook/data 


The first thing we do is create a DataFrame from it: 


import pandas as pd 
df=pd.read_ csv('data/sampleData.csv') 


Check the first few rows 1n this data: 


df.head() 


We can see that the product code and the station number are the same for each row and that this 
information is superfluous. Also, the days of accumulated maximum temperature are not needed for 
our purpose, so we will delete them as well: 


del df['Bureau of Meteorology station number' ] 
del df['Product code'] 
del df['Days of accumulation of maximum temperature' ] 


Let's make our data a little easier to read by shorting the column labels: 


df=df.rename (columns={ 'Maximum temperature (Degree C)':'maxtemp' }) 


We are only interested in data that is of high qliality: 8& We include only records that have a Yin the 


book.org 


quality column: 


df=df[(df.Quality=='Y') ] 


We can get a statistical summary of our data: 


df.describe () 


count 442750. 


Year 
BOOeRR 
#07563 


.21227/0 
.880000 
.880000 
. 868000 
». BOGOR 
.860000 


Month 
GOBER 


336339 
.446311 
.880000 
.880000 
;. BOBORE 
.880000 
.860000 


Day 
Pele leis 1. 


. 7348/70 
.6070389 
.880000 
.880000 
. 808008 
.880000 
.860000 


haxTemp 


442750. 
16. 


BeGO08 
979941 


.930362 
. 300000 
. 280000 
. 408000 
.880800 
. 680000 





If we import the matplotlib.pyplot package, we can graph the data: 


import matplotlib.pyplot as plt 
plt.plot(df.Year, df.maxtemp) 


1900 1920 1940 1980 2000 2020 





Notice that PyPlot correctly formats the date axis and deals with the missing data by connecting the 
two known points on either side. We can convert a DataFrame into a NumPy array using the 
following: 


ndarray = df.values 


If the DataFrame contains a mixture of data types, then this function will convert them to the lowest 
common denominator type, which means that the one that accommodates all values will be chosen. 
For example, if the DataFrame consists of a mix of floatl6 and float32 types, then the values will be 
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The Pandas DataFrame 1s a great object for viewing and manipulating simple text and numerical data. 
However, Pandas is probably not the right tool for more sophisticated numerical processing such as 
calculating the dot product, or finding the solutions to linear systems. For numerical applications, we 
generally use the NumPy classes. 
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SciPy 


SciPy (pronounced sigh p1) adds a layer to NumPy that wraps common scientific and statistical 
applications on top of the more purely mathematical constructs of NumPy. SciPy provides higher- 
level functions for manipulating and visualizing data, and it is especially useful when using Python 
interactively. SciPy 1s organized into sub-packages covering different scientific computing 
applications. A list of the packages most relevant to ML and their functions appear as follows: 


Description 


This contains two sub-packages: 


cluster.vq for K-means clustering and vector quantization. 


cluster.hierachy for hierarchical and agglomerative clustering, which ts useful for distance matrices, calculating 
statistics on clusters, as well as visualizing clusters with dendrograms. 


These are physical and mathematical constants such as pi and e. 
These are differential equation solvers 
interpolatelfThese are mterpolation functions for creating new data points within a range of known points. 


This refers to input and output functions for creating strmg, bmary, or raw data streams, and reading and writing to and 
from files. 


This refers to optimizing and finding roots. 


This refers to lmear algebra routines such as basic matrix calculations, solving linear systems, finding determmants and 
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norms, and decomposition. 





; This 1s N-dimensional image processing. 
odr This 1s orthogonal distance regression. 
This refers to statistical distributions and functions. 


Many of the NumPy modules have the same name and similar functionality as those in the SciPy 
package. For the most part, SciPy imports its NumPy equivalent and extends its functionality. 
However, be aware that some identically named functions in SciPy modules may have slightly 
different functionality compared to those in NumPy. It also should be mentioned that many of the 
SciPy classes have convenience wrappers in the scikit-learn package, and it is sometimes easier to 


use those instead. WOW! eBook 
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Each of these packages requires an explicit import; here is an example: 


import scipy.cluster 


You can get documentation from the SciPy website (scipy.org) or from the console, for example, 
belie coy. Cluster). 


As we have seen, a common task 1n many different ML settings 1s that of optimization. We looked at 
the mathematics of the simplex algorithm in the last chapter. Here is the implementation using SciPy. 
We remember simplex optimizes a set of linear equations. The problem we looked at was as follows: 


Maximize x7 + x2 within the constraints of: 2x7 +x2 <4 andxj) + 2x7 <3 


The 1inprog object is probably the simplest object that will solve this problem. It is a minimization 
algorithm, so we reverse the sign of our objective. 


From scipy.optimize, import linprog: 


objective=[-1,-1] 
conl=[[2,1],[1,2]] 

con2=[4,3] 

res=linprog (objective,conl1,con2) 
print(res) 


You will observe the following output: 


nit: 2 
message: ‘Optimization terminated successfully. ' 
Status: 06 
x: array([ 1.66666667, 0©.66666667]) 


Success: True 
fun: -2.3333333333333335 
slack: array([ 6., @.]) 





There is also an optimisation.minimize object that is suitable for slightly more complicated 
problems. This object takes a solver as a parameter. There are currently about a dozen solvers 
available, and if you need a more specific solver, you can write your own. The most commonly used, 
and suitable for most problems, is the nelder-mead solver. This particular solver uses a downhill 
simplex algorithm that is basically a heuristic search that replaces each test point with a high error 
with a point located in the centroid of the remaining points. It iterates through this process until it 
converges on a minimum. 


In this example, we use the Rosenbrock function as our test problem. This is a non-convex function 
that 1s often used to test optimization problems. The global minimum of this function is on a long 
parabolic valley, and this makes it challenging for an algorithm to find the minimum in a large, 


relatively flat valley. We will see more of thisMOme tidak 
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import numpy as np 
from scipy.optimize import minimize 
def rosen (x): 
return sum(100.0* (x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1]) **2.0) 
def nMin(funct,x0) : 


return (minimize(rosen, x0, method='nelder-mead', options={'xtol': 
le-8, 'disp': True}) ) 


x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2]) 


nMin (rosen,x0) 


The output for the preceding code 1s as follows: 


Optimization terminated successfully. 
Current function value: 6.986000 


Iterations: 339 
Function evaluations: 571 





The minimize function takes two mandatory parameters. These are the objective function and the 
initial value of x0. The minimize function also takes an optional parameter for the solver method, in 
this example we use the nelder-mead method. The options are a solver-specific set of key-value 
pairs, represented as a dictionary. Here, xto1 1s the relative error acceptable for convergence, and 
disp 1S set to print a message. Another package that 1s extremely useful for machine learning 
applications 1s scipy.linalg. This package adds the ability to perform tasks such as inverting 
matrices, calculating eigenvalues, and matrix decomposition. 
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Scikit-learn 


This includes algorithms for the most common machine learning tasks, such as classification, 
regression, clustering, dimensionality reduction, model selection, and preprocessing. 


Scikit-learn comes with several real-world data sets for us to practice with. Let's take a look at one 
of these—the Iris data set: 


from sklearn import datasets 
iris = datasets.load iris () 
iris X = iris.data 

iris y = iris.target 

iris X.shape 

(150, 4) 


The data set contains 150 samples of three types of irises (Setosa, Versicolor, and Virginica), each 
with four features. We can get a description on the dataset: 


1ris.DESCR 


We can see that the four attributes, or features, are sepal width, sepal length, petal length, and petal 
width in centimeters. Each sample is associated with one of three classes. Setosa, Versicolor, and 
Virginica. These are represented by 0, 1, and 2 respectively. 


Let's look at a simple classification problem using this data. We want to predict the type of iris based 
on its features: the length and width of its sepal and petals. Typically, scikit-learn uses estimators to 
implement a fit (x, y) method and for training a classifier, and a predict (x) method that if given 
unlabeled observations, x, returns the predicted labels, y. The fit () and predict () methods usually 
take a 2D array-like object. 


Here, we are going to use the K Nearest Neighbors (IK-NN) technique to solve this classification 
problem. The principle behind K-NN is relatively simple. We classify an unlabeled sample according 
to the classification of its nearest neighbors. Each data point is assigned class membership according 
to the majority class of a small number, k, of its nearest neighbors. K-NN is an example of instanced- 
based learning, where classification is not done according to an inbuilt model, but with reference to a 
labeled test set. The K-NN algorithm is known as non generalizing, since it simply remembers all its 
training data and compares it to each new sample. Despite, or perhaps because of, its apparent 
simplicity, K-NN is a very well used technique for solving a variety of classification and regression 
problems. 


There are two different K-NN classifiers in Sklearn. KNeighborsClassifier requires the user to 
specify k, the number of nearest neighbors. RadiusNeighborsClassifier, on the other hand, 
implements learning based on the number of neighbors within a fixed radius, 7, of each training point. 
KNeighborsClassifier is the more commonly used one. The optimal value for & is very much 
dependent on the data. In general, a larger k vilowsBaséd with noisy data. The trade off being the 
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classification boundary becomes less distinct. If the data is not uniformly sampled, then 
RadiusNeighborsClassifier may be a better choice. Since the number of neighbors is based on the 
radius, & will be different for each point. In sparser areas, k will be lower than in areas of high 
sample density: 


from sklearn.neighbors import KNeighborsClassifier as knn 
from sklearn import datasets 

import numpy as np 

import matplotlib.pyplot as plt 

from matplotlib.colors import ListedColormap 


def knnDemo(X,y, n): 


#cresates the the classifier and fits it to the data 
res=0.05 

kl = knn(n_neighbors=n,p=2,metric='minkowski' ) 
k1.£1t(X,y) 


#sets up the grid 

xl min, xl_max = X[:, O].min() - 1, X[:, O].max() + 1 

x2 min, x2 max = X[:, 1].min() - 1, X[:, 1].max() + 1 

xxl, xx2 = np.meshgrid(np.arange(xl_ min, xl_max, res) ,np.arange(x2 min, 
x2 max, res) ) 


#makes the prediction 
Z = kl.predict(np.array([xxl.ravel(), xx2.ravel()]).T) 
Z = Z.reshape (xx1.shape) 


#creates the color map 
cmap light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF' ]) 
cmap bold = ListedColormap(['#FF0000', '#00FFO0', '#0000FF']) 


#Plots the decision surface 

pit.contourf (xxl, xx2, Z, alpha=0.4, cmap=cmap_ light) 
plt.xlim(xxl.min(), xx1.max() ) 

plt.ylim(xx2.min(), xx2.max() ) 


#plots the samples 
for idx, cl in enumerate (np.unique(y) ): 
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_ bold) 


plt. show () 


iris = datasets.load iris () 
X1 = iris.data[:, 0:3:2] 

X2 iris.data[:, 0:2] 

X3 = iris.data[:,1:3] 

y = iris.target 

knnDemo (X2,y,15) 


Here is the output of the preceding commandsivow: epook 
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3-Class classification (k = 15, weights = 'uniform’) 3-Class classification (k = 15, weights = ‘distance’) 





Let's now look at regression problems with Sklearn. The simplest solution 1s to minimize the sum of 
the squared error. This 1s performed by the LinearRegression object. This object has a fit () 
method that takes two vectors: _X, the feature vector, and y, the target vector: 


from sklearn import linear model 


clf = linear model.LinearRegression () 
clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2]) 
clf.coef 


array([ 0.5, 0.5]) 


The LinearRegression object has four optional parameters: 


e fit intercept: A Boolean, which if set to false, will assume that the data is centered, and the 
model will not use an intercept in its calculation. The default value is true. 

e normalize: Iftrue, X will be normalized to zero mean and unit variance before regression. 
This is sometimes useful because it can make interpreting the coefficients a little more explicit. 
The default is false. 

e copy x: Defaults to true. If set to false, it will allow_X to be overwritten. 

e n jobs: Is the number of jobs to use for the computation. This defaults to 1. This can be used to 
speed up computation for large problems on multiple CPUs. 


Its output has the following attributes: 


e coef : Anarray of the estimated coefficients for the linear regression problem. If y is 
multidimensional, that 1s there are multiple target variables, then coef will be a 2D array of the 
form(n targets,n features). If only one target variable 1s passed, then coef will bea 1D 
array of length (n features). 

e intercept _: This is an array of the intercept or independent terms in the linear model. 


For the Ordinary Least Squares to work, we assume that the features are independent. When these 
terms are correlated, then the matrix, X, can approach singularity. This means that the estimates 
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become highly sensitive to small changes in the input data. This is known as multicollinearity and 
results in a large variance and ultimately instability. We discuss this in greater detail later, but for 
now, let's look at an algorithm that, to some extent, addresses these issues. 


Ridge regression not only addresses the issue of multicollinearity, but also situations where the 
number of input variables greatly exceeds the number of samples. The linear model.Ridge() 
object uses what is known as L2 regularization. Intuitively, we can understand this as adding a penalty 
on the extreme values of the weight vector. This is sometimes called shrinkage because it makes the 
average weights smaller. This tends to make the model more stable because it reduces its sensitivity 
to extreme values. 


The Sklearn object, linear model.ridge, adds a regularization parameter, alpha. Generally, small 
positive values for alpha improves the model's stability. It can either be a float or an array. If itis an 
array, 1t is assumed that the array corresponds to specific targets, and therefore, it must be the same 
size as the target. We can try this out with the following simple function: 


from sklearn.linear model import Ridge 
import numpy as np 


def ridgeReg (alpha): 


n samples, n features = 10, 5 
y = np.random.randn(n_ samples) 
X = np.random.randn(n samples, n features) 
clf = Ridge(.001) 
res=clf.fit(X, y) 
return (res) 
res= ridgeReg (0.001) 
print (res.coef ) 
print (res.intercept ) 


Let's now look at some scikit-learn algorithms for dimensionality reduction. This is important for 
machine learning because it reduces the number of input variables or features that a model has to 
consider. This makes a model more efficient and can make the results easier to interpret. It can also 
increase a model's generalization by reducing overfitting. 


It is important, of course, to not discard information that will reduce the accuracy of the model. 
Determining what is redundant or irrelevant is the major function of dimensionality reduction 
algorithms. There are basically two approaches: feature extraction and feature selection. Feature 
selection attempts to find a subset of the original feature variables. Feature extraction, on the other 
hand, creates new feature variables by combining correlated variables. 


Let's first look at probably the most common feature extraction algorithm, that is, Principle 
Component Analysis or PCA. This uses an orthogonal transformation to convert a set of correlated 
variables into a set of uncorrelated variables. The important information, the length of vectors, and 
the angle between them does not change. This information is defined 1n the inner product and is 
preserved in an orthogonal transformation Wer Coustaictsa feature vector in such a way that the first 


component accounts for as much of the variability in the data as possible. Subsequent components 
then account for decreasing amounts of variability. This means that, for many models, we can just 
choose the first few principle components until we are satisfied that they account for as much 
variability in our data as 1s required by the experimental specifications. 


Probably the most versatile kernel function, and the one that gives good results in most situations, 1s 
the Radial Basis Function (RBF). The rbf kernel takes a parameter, gamma, which can be loosely 
interpreted as the inverse of the sphere of influence of each sample. A low value of gamma means that 
each sample has a large radius of influence on samples selected by the model. The KernalPca 

fit transform method takes the training vector, fits it to the model, and then transforms it into its 
principle components. Let's look at the commands: 


import numpy as np 

import matplotlib.pyplot as plt 

from sklearn.decomposition import KernelPCA 

from sklearn.datasets import make circles 
np.random. seed (0) 

X, y = make circles(n_ samples=400, factor=.3, noise=.05) 
kpca = KernelPCA(kernel='rbf', gamma=10) 

X kpca = kpca.fit transform (xX) 

plt.figure () 

plt.subplot(2, 2, 1, aspect='equal') 
plt.title("Original space") 

reds = y == 

blues = y == 

plt.plot(X[reds, 0], X[reds, 1], "ro") 
plt.plot(X[blues, 0], X[blues, 1], "bo") 
plt.xlabel ("$x _ 15") 

plt.ylabel ("$x 25") 

plt.subplot(2, 2, 3, aspect='equal') 

plt.plot(X kpca[reds, 0], X_kpca[reds, 1], "ro") 
plt.plot(X kpca[blues, 0], X_kpca[blues, 1], "bo") 
plt.title("Projection by KPCA") 

plt.xlabel("1st principal component in space induced by $\phi$") 
plt.ylabel ("2nd component") 

plt.subplots adjust(0.02, 0.10, 0.98, 0.94, 0.04, 0.35) 
plt.show() 

#print('gamma= %0.2f' %gamma) 


As we have seen, a major obstacle to the success of a supervised learning algorithm is the translation 
from training data to test data. A labeled training set may have distinctive characteristics that are not 
present in new unlabeled data. We have seen that we can train our model to be quite precise on 
training data, yet this precision may not be translated to our unlabeled test data. Overfitting 1s an 
important problem in supervised learning and there are many techniques you can use to minimize it. A 
way to evaluate the estimator performance of the model on a training set is to use cross validation. 
Let's try this out on our iris data using a support vector machine. The first thing that we need to do is 
split our data into training and test sets. The train test split method takes two data structures: 
the data itself and the target. They can be eitheyQNwetdylarrays, Pandas DataFrames lists, or SciPy 
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matrices. As you would expect, the target needs to be the same length as the data. The test size 
argument can either be a float between 0 and 1, representing the proportion of data included in the 
split, or an int representing the number of test samples. Here, we have used atest size object as 
.3, indicating that we are holding out 40% of our data for testing. 


In this example, we use the svm. Svc class and the . score method to return the mean accuracy of the 
test data in predicting the labels: 


from sklearn.cross validation import train test split 

from sklearn import datasets 

from sklearn import svm 

from sklearn import cross validation 

iris = datasets.load iris () 

X train, X_ test, y train, y test = train test split (iris.data, iris.target, 
test size=0.4, random state=0) 

clf = svm.SVC(kernel='linear', C=1).f1it(X train, y train) 

scores=cross validation.cross val score(clf, X_ train, y train, cv=5) 

print ("Accuracy: %0.2f£ (+/- %0.2£)" % (scores.mean(), scores.std() * 2)) 


You will observe the following output: 


Accuracy: 6.99 (+/- @.@5) 


Support vector machines have a penalty parameter that has to be set manually, and it 1s quite likely 
that we will run the SVC many times and adjust this parameter until we get an optimal fit. Doing this, 
however, leaks information from the training set to the test set, so we may still have the problem of 
over fitting. This is a problem for any estimator that has parameters that must be set manually, and we 
will explore this further in Chapter 4, Models — Learning from Information. 
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Summary 


We have seen a basic kit of machine learning tools and a few indications of their uses on simple 
datasets. What you may be beginning to wonder is how these tools can be applied to real-world 
problems. There 1s considerable overlap between each of the libraries we have discussed. Many 
perform the same task, but add or perform the same function in a different way. Choosing which 
library to use for each problem is not necessarily a definitive decision. There 1s no best library; there 
is only the preferred library, and this varies from person to person, and of course, to the specifics of 
the application. 


In the next chapter, we will look at one of the most important, and often overlooked, aspects of 
machine learning, that is, data. 
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Chapter 3. Turning Data into Information 


Raw data can be in many different formats and of varying quantity and quality. Sometimes, we are 
overwhelmed with data, and sometimes we struggle to get every last drop of information from our 
data. For data to become information, it requires some meaningful structure. We often have to deal 
with incompatible formats, inconsistencies, errors, and missing data. It 1s important to be able to 
access different parts of the dataset or extract subsets of the data based on some relational criteria. 
We need to spot patterns in our data and get a feel for how the data 1s distributed. We can use many 
tools to find this information hidden in data from visualizations, running algorithms, or just looking at 
the data 1n a spreadsheet. 


In this chapter, we are going to introduce the following broad topics: 


Big data 

Data properties 

Data sources 

Data processing and analysis 


But first, let's take a look into the following explanations: 
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What Is data? 


Data can be stored on a hard drive, streamed through a network, or captured live through sensors such 
as video cameras and microphones. If we are sampling from physical phenomena, such as a video or 
sound recording, the space is continuous and effectively infinite. Once this space is sampled, that is 
digitalized, a finite subset of this space has been created and at least some minimal structure has been 
imposed on it. The data is on a hard drive, encoded in bits, given some attributes such as a name, 
creation date, and so on. Beyond this, if the data 1s to be made use of in an application, we need to 
ask, "how is the data organized and what kinds of queries does it efficiently support?" 


When faced with an unseen dataset, the first phase 1s exploration. Data exploration involves 
examining the components and structure of data. How many samples does it contain, and how many 
dimensions are in each sample’? What are the data types of each dimension? We should also get a feel 
for the relationships between variables and how they are distributed. We need to check whether the 
data values are in line with what we expect. Are there are any obvious errors or gaps 1n the data’? 


Data exploration must be framed within the scope of a particular problem. Obviously, the first thing to 
find out 1s if itis likely that the dataset will provide useful answers. Is 1t worth our while to continue, 
or do we need to collect more data? Exploratory data analysis is not necessarily carried out witha 
particular hypothesis in mind, but perhaps with a sense of which hypotheses are likely to provide 
useful information. 


Data is evidence that can either support or disprove a hypothesis. This evidence 1s only meaningful if 
it can be compared to a competing hypothesis. In any scientific process, we use a control. To test a 
hypothesis, we need to compare it to an equivalent system where the set of variables we are 
interested 1n remain fixed. We should attempt to show causality with a mechanism and explanation. 
We need a plausible reason for our observations. We should also consider that the real world 1s 
composed of multiple interacting components, and dealing with multivariate data can lead to 
exponentially increasing complexity. 


It is with these things in mind, a sketch of the territory we are seeking to explore, that we approach 
new datasets. We have an objective, a point we hope to get to, and our data 1s a map through this 
unknown terrain. 
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Big data 


The amount of data that's being created and stored on a global level is almost inconceivable, and it 
just keeps growing. Big data is a term that describes the large volume of data—both structured and 
unstructured. Let's now delve deeper into big data, beginning with the challenges of big data. 
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Challenges of big data 


Big data 1s characterized by three challenges. They are as follows: 


e The volume of the data 
e The velocity of the data 
e The variety of the data 


Data volume 


The volume problem can be approached from three different directions: efficiency, scalability, and 
parallelism. Efficiency is about minimizing the time it takes for an algorithm to process a unit of 
information. A component of this 1s the underlying processing power of the hardware. The other 
component, and the one that we have more control over, 1s ensuring that our algorithms are not 
wasting precious processing cycles with unnecessary tasks. 


Scalability is really about brute force and throwing as much hardware at a problem as you can. 
Taking into account Moore's law, which states that the trend of computer power doubling every two 
years, will continue until it reaches its limit; 1tis clear that scalability 1s not, by itself, going to be 
able to keep up with the ever-increasing amounts of data. Simply adding more memory and faster 
processors 1s not, in many cases, going to be a cost effective solution. 


Parallelism is a growing area of machine learning, and it encompasses a number of different 
approaches, from harnessing the capabilities of multi-core processors, to large-scale distributed 
computing on many different platforms. Probably, the most common method 1s to simply run the same 
algorithm on many machines, each with a different set of parameters. Another method is to decompose 
a learning algorithm into an adaptive sequence of queries, and have these queries processed in 
parallel. A common implementation of this technique is known as MapReduce, or its open source 
version, Hadoop. 


Data velocity 


The velocity problem is often approached in terms of data producers and data consumers. The rate of 
data transfer between the two 1s called the velocity, and it can be measured in interactive response 
times. This 1s the time it takes from a query being made to its response being delivered. Response 
times are constrained by latencies, such as hard disk read and write times, and the time it takes to 
transmit data across a network. 


Data is being produced at ever greater rates, and this is largely driven by the rapid expansion of 
mobile networks and devices. The increasing instrumentation of daily life is revolutionizing the way 
products and services are delivered. This increasing flow of data has led to the idea of streaming 
processing. When input data 1s at a velocity that makes it impossible to store in its entirety, a level of 
analysis is necessary as the data streams, 1n essence, deciding what data is useful and should be 
stored, and what data can be thrown away. An extreme example is the Large Hadron Collider at 
CERN, where the vast majority of data is disaaodedsdéosophisticated algorithm must scan the data as 
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itis being generated, looking at the information needle in the data haystack. Another instance that 
processing data streams may be important 1s when an application requires an immediate response. 
This 1s becoming increasingly used 1n applications such as online gaming and stock market trading. 


It is not just the velocity of incoming data that we are interested 1n; 1n many applications, particularly 
on the web, the velocity of a systems output is also important. Consider applications such as 
recommender systems that need to process a large amount of data and present a response in the time it 
takes for a web page to load. 


Data variety 


Collecting data from different sources invariably means dealing with misaligned data structures and 
incompatible formats. It also often means dealing with different semantics and having to understand a 
data system that may have been built on a fairly different set of logical premises. We have to 
remember that, very often, data 1s repurposed for an entirely different application from the one it was 
originally intended for. There 1s a huge variety of data formats and underlying platforms. Significant 
time can be spent converting data into one consistent format. Even when this 1s done, the data itself 
needs to be aligned such that each record consists of the same number of features and 1s measured 1n 
the same units. 


Consider the relatively simple task of harvesting data from web pages. The data 1s already structured 
through the use of a mark language, typically HTML or XML, and this can help give us some initial 
structure. Yet, we just have to peruse the web to see that there is no standard way of presenting and 
tagging content in an information-relevant way. The aim of XMLis to include content-relevant 
information in markup tags, for instance, by using tags for author or subject. However, the usage of 
such tags 1s far from universal and consistent. Furthermore, the web is a dynamic environment and 
many web sites go through frequent structural changes. These changes will often break web 
applications that expect a specific page structure. 


The following diagram shows two dimensions of the big data challenge. I have included a few 
examples where these domains might approximately sit 1n this space. Astronomy, for example, has 
very few sources. It has a relatively small number of telescopes and observatories. Yet the volume of 
data that astronomers deal with is huge. On the other hand, perhaps, let's compare it to something like 
environmental sciences, where the data comes froma variety of sources, such as remote sensors, field 
surveys, validated secondary materials, and so on. 
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Integrating different data sets can take a significant amount of development time; up to 90 percent in 
some cases. Each project's data requirements will be different, and an important part of the design 
process is positioning our data sets with regard to these three elements. 
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Data models 


A fundamental question for the data scientist 1s how the data is stored. We can talk about the 
hardware, and in this respect, we mean nonvolatile memory such as the hard drive of a computer or 
flash disk. Another way of interpreting the question (a more logical way) 1s how is the data 
organized? In a personal computer, the most visible way that data 1s stored is hierarchically, in nested 
folders and files. Data can also be stored in a table format or 1n a spreadsheet. When we are thinking 
about structure, we are interested in categories and category types, and how they are related. Ina 
table, how many columns do we need, and in a relational data base, how are tables linked? A data 
model should not try to impose a structure on the data, but rather find a structure that most naturally 
emerges from the data. 


Data models consist of three components: 


e Structure: A table is organized into columns and rows; tree structures have nodes and edges, 
and dictionaries have the structure of key value pairs. 

e Constraints: This defines the type of valid structures. For a table, this would include the fact 
that all rows have the same number of columns, and each column contains the same data type for 
every row. For example, a column, items sold, would only contain integer values. For 
hierarchical structures, a constraint would be a folder that can only have one immediate parent. 

e Operations: This includes actions such as finding a particular value, given a key, or finding all 
rows where the items sold are greater than 100. This is sometimes considered separate from the 
data model because it 1s often a higher-level software layer. However, all three of these 
components are tightly coupled, so it makes sense to think of the operations as part of the data 
model. 


To encapsulate raw data with a data model, we create databases. Databases solve some key 
problems: 


e They allow us to share data: It gives multiple users access to the same data with varying read 
and write privileges. 

e They enforce a data model: This includes not only the constraints imposed by the structure, say 
parent child relationships in a hierarchy, but also higher-level constraints such as only allowing 
One user named bob, or being a number between one and eight. 

e They allow us to scale: Once the data is larger than the allocated size of our volatile memory, 
mechanisms are needed to both facilitate the transfer of data and also allow the efficient 
traversal of a large number of rows and columns. 

e Databases allow flexibility: They essentially try to hide complexity and provide a standard way 
of interacting with data. 
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Data distributions 


A key characteristic of data is its probability distribution. The most familiar distribution is the normal 
or Gaussian distribution. This distribution 1s found in many (all?) physical systems, and 1t underlies 
any random process. The normal function can be defined in terms of a probability density function: 


| e (x-ny 
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Here, 6 (sigma) is the standard deviation and u (mu) is the mean. This equation simply describes 
the relative likelihood a random variable, x, will take on a given value. We can interpret the standard 
deviation as the width of a bell curve, and the mean as its center. Sometimes, the term variance 1s 
used, and this is simply the square of the standard deviation. The standard deviation essentially 
measures how spread out the values are. As a general rule of thumb, in a normal distribution, 68% of 
the values are within | standard deviation of the mean, 95% of values are within 2 standard 
deviations of the mean, and 99.7% are within 3 standard deviations of the mean. 


We can get a feel for what these terms do by running the following code and calling the normal () 
function with different values for the mean and variance. In this example, we create the plot of a 
normal distribution, with a mean of 1 and a variance of 0.5: 


import numpy as np 
import matplotlib.pyplot as plt 
import matplotlib.mlab as mlab 


def normal (mean = O, var = 1): 
Sigma = np.sqrt(var) 
x = np.linspace(-3,3,100) 
plt.plot(x,mlab.normpdf (x,mean,sigma) ) 
plt. show () 


normal (1,0.5) 
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Related to the Gaussian distribution is the binomial distribution. We actually obtain a normal 
distribution by repeating a binomial process, such as tossing a coin. Over time, the probability 
approaches that half the tosses will result in heads. 


a a 
P(x) _ _() pg(*) 
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In this formula, nis the number coin tosses, p is the probability that half the tosses are heads, and q 1s 
the probability (/-p) that half the tosses are tails. In a typical experiment, say to determine the 
probability of various outcomes of a series of coin tosses, n, we can perform this many times, and 
obviously the more times we perform the experiment, the better our understanding of the statistical 
behavior of the system: 


from scipy.stats import binom 
def binomial (x=10,n=10, p=0.5): 
fig, ax = plt.subplots(1, 1) 
x=range (x) 
rv = binom(n, p) 
plt.vlines(x, 0, (rv.pmf(x)), colors='k', linestyles='-') 
plt. show () 
binomial () 


You will observe the following output: 
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Another aspect of discrete distributions 1s understanding the likelihood of a given number of events 
occurring within a particular space and/or time. If we know that a given event occurs at an average 
rate, and each event occurs independently, we can describe it as a Poisson distribution. We can best 
understand this distribution using a probability mass function. This measures the probability of a 
given event that will occur at a given point in space/time. 


The Poisson distribution has two parameters associated with it: lambda ,A, a real number greater than 
0, and é, an integer that is 0, 1, 2, and so on. 


e” 


f(kA)=Pr(X =k)= am 


Here, we generate the plot of a Poisson distribution using the scipy.stats module: 


from scipy.stats import poisson 
def pois (x=1000): 
xr=range (x) 
ps=poisson (xr) 
plt.plot(ps.pmf (x/2) ) 
pois () 


The output of the preceding commands is as shown in the following diagram: 


WOW! eBook 
www.wowebook.org 





We can describe continuous data distributions using probability density functions. This describes the 
likelihood that a continuous random variable will take on a specified value. For univariate 
distributions, that is, those where there 1s only one random variable, the probability of finding a point 
X on an interval (a,b) 1s given by the following: 

b 

[fe(x)d 
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This describes the fraction of a sampled population for which a value, x, lies between a and b. 
Density functions really only have meaning when they are integrated, and this will tell us how densely 
a population is distributed around certain values. Intuitively, we understand this as the area under the 
eraph of its probability function between these two points. The Cumulative Density Function (CDF) 
is defined as the integral of its probability density functions, fx: 


F(x) { fy (u)du 


d 


The CDF describes the proportion of a sampled population having values for a particular variable 
that 1s less than x. The following code shows a discrete (binomial) cumulative distribution function. 
The s1 and s2 shape parameters determine the step size: 


import scipy.stats as stats 
def cdf (s1=50 P s2=0. 2) : WOW! eBook 
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x = np.linspace(0,s2 * 100,s1 *2) 
cd = stats.binom.cdf 
plt.plot(x,cd(x, sl, s2)) 
plt.show () 
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Data from databases 


We generally interact with databases via a query language. One of the most popular query languages 
is MySQL. Python has a database specification, PEP 0249, which creates a consistent way to work 
with numerous database types. This makes the code we write more portable across databases and 
allows a richer span of database connectivity. To illustrate how simple this 1s, we are going to use the 
mysql.connector Class as an example. MySQL is one of the most popular database formats, with a 
straight forward, human-readable query language. To practice using this class, you will need to have a 
MySQL server installed on your machine. This 1s available from 


https://dev.mysql.com/downloads/mysql/. 


This should also come with a test database called world, which includes statistical data on world 
cities. 


Ensure that the MySQL server is running, and run the following code: 


import mysgql.connector 
from mysql.connector import errorcode 


cnx = mysgql.connector.connect (user='root', password='password', 
database='world', buffered=True) 

cursor=cnx.cursor (buffered=True) 

query=("select * from city where population > 1000000 order by population") 

cursor.execute (query) 

worldList=[ |] 

for (city) in cursor: 

worldList.append([city[1],city[4]]) 
cursor.close() 
cnx.close() 
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Data from the Web 


Information on the web is structured into HTML or XML documents. Markup tags give us clear hooks 
for us to sample our data. Numeric data will often appear in a table, and this makes it relatively easy 
to use because it is already structured in a meaningful way. Let's look at a typical excerpt from an 
HTML document: 


<table border="0" cellpadding="5" cellspacing="2" class="details" width="95%"> 
<tbody> 


<th>Species</th> 
<th>Datal</th> 
<Lh>dataZ</th-> 
Baya. 


<td>whitefly</td> 
aAto>Z24</ td 
tO >16</ o> 
<—-tn 
</toody> 
</table> 


This shows the first two rows of a table, with a heading and one row of data containing two values. 
Python has an excellent library, Beautiful Soup, for extracting data from HTML and XML 
documents. Here, we read some test data into an array, and put it into a format that would be suitable 
for input in a machine learning algorithm, say a linear classifier: 


import urllib 
from bs4 import BeautifulSoup 
import numpy as np 


url = urllib.request.urlopen("http://interthing.org/dmls/species.html1") ; 
html url.read() 

soup = BeautifulSoup (html, "1lxml") 

table = soup.find("table") 


headings = [th.get text() for th in table.find("tr") .find all("th") ] 


datasets = [] 

for row in table.find all("tr") [1:]: 
dataset = list(zip(headings, (td.get_text() for td in row.find all("td")))) 
datasets .append (dataset) 


nd=np.array (datasets) 
features=nd[:,1:,1].astype('float') 
targets=(nd[:,0,1:]) .astype('str') 
print (features) 

print (targets) 


As we can see, this is relatively straight forwardwWhadiave need to be aware of is that we are relying 
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on our source web page to remain unchanged, at least in terms of its overall structure. One of the 
major difficulties with harvesting data off the web in this way 1s that if the owners of the site decide 
to change the layout of their page, it will likely break our code. 


Another data format you are likely to come across is the JSON format. Originally used for serializing 
Javascript objects, JSON is not, however, dependent on JavaScript. It 1s merely an encoding format. 
JSON is useful because it can represent hierarchical and multivariate data structures. It is basically a 
collection of key value pairs: 


{"Languages":[{"Language":"Python","Version":"0"}, 
{"tangueage”™: PH’, "Versron" 2. "5")], 

MOS” 24" MEerosore =: "Wancows LO", “lanux” "Ubuntu 14"), 
"Name":"John\"the fictional\" Doe", 

MlOCa Lon" s4" 5Sereet t=" S0Ome Srreet", “Suburb” ="sSome Suburb}, 
"Languages": [{"Language":"Python","Version":"0"}, 
{"Language":"PHP", "Version":"5"}] 


} 


If we save the preceding JSON to a file called jsondata.json: 


import json 
from pprint import pprint 


with open('jsondata.json') as file: 
data = json.load(file) 


pprint (data) 
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Data from natural language 


Natural language processing 1s one of the more difficult things to do in machine learning because it is 
focuses on what machines, at the moment, are not very good at: understanding the structure in complex 
phenomena. 


As a Starting point, we can make a few statements about the problem space we are considering. The 
number of words in any language is usually very large compared to the subset of words that are used 
in a particular conversation. Our data is sparse compared to the space it exists in. Moreover, words 
tend to appear in predefined sequences. Certain words are more likely to appear together. Sentences 
have a certain structure. Different social settings, such as at work, home, or out socializing; or in 
formal settings such as communicating with regulatory authorities, government, and bureaucratic 
settings, all require the use overlapping subsets of a vocabulary. A part from cues such as body 
language, intonation eye contact, and so forth, the social setting is probably the most important factor 
when trying to extract meaning from natural language. 


To work with natural language in Python, we can use the the Natural Language Tool Kit (NLTK). If 
it is not installed, you can execute the pip install -U nltk command. 


The NLTK also comes with a large library of lexical resources. You will need to download these 
separately, and NLTK has a download manager accessible through the following code: 


import nltk 
nltk.download () 


A window should open where you can browse through the various files. This includes a range of 
books and other written material, as well as various lexical models. To get started, you can just 
download the package, Book. 


A text corpus is a large body of text consisting of numerous individual text files. NUTK comes with 
corpora from a variety of sources such as classical literature (the Gutenberg Corpus), the web and 
chat text, Reuter news, and corpus containing text categorized by genres such as new, editorial, 
religion, fiction, and so on. You can also load any collection of text files using the following code: 


from nltk.corpus import PlaintextCorpusReader 
corpusRoot= 'path/to/corpus' 
yourCorpus=PlaintextCorpusReader(corpusRoot, '.*') 


The second argument to the PlaintextCorpusReader method 1s a regular expression indicating the 
files to include. Here, it simply indicates that all the files in that directory are included. This second 
parameter could also be a list of file locations, suchas ['filel', '‘dir2/file2']. 


Let's take a look at one of the existing corpora, and as an example, we are going to load the Brown 
Corpus: 
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from nltk.corpus import brown 
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cat=brown.categories () 


print (cat) 
['adventure', 'belles lettres', ‘editorial', 'fiction', 'government', 'hobbies', 
"humor', 'learned', ‘lore', 'mystery', 'news', 'religion', 'reviews', ‘'romance', 


‘science fiction'] 


The Brown corpus 1s useful because it enables us to study the systemic differences between genres. 
Here is an example: 


from nltk.corpus import brown 
cats=brown.categories () 
for cat in cats: 

text=brown .words (categories=cat) 

fdist = nltk.FreqDist(w.lower() for w in text) 


posmod = ['love', ‘happy', 'good', ‘'clean'] 
negmod = ['hate', 'sad', ‘'bad', ‘'dirty'] 
pcount=[] 


ncount=[ ] 

for m in posmod: 
pcount.append (fdist[m] ) 

for m in negmod: 
ncount.append (fdist[m] ) 


print(cat + ' positive: ' + str(sum(pcount) ) ) 
print(cat + ' negative: ' + str(sum(ncount) ) ) 
rat=sum(pcount) /sum(ncount) 

print('ratio= ts'Srat ) 

print () 


Here, we have sort of extracted sentiment data from different genres by comparing the occurrences of 
four positive sentiment words with their antonyms. 
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Data from images 


Images are a rich and easily available source of data, and they are useful for learning applications 
such as object recognition, grouping, grading objects, as well as image enhancement. Images, of 
course, can be put together as a time series. Animating images is useful for both presentation and 
analysis; for example, we can use video to study trajectories, monitor environments, and learn 
dynamic behavior. 


Image data 1s structured as a grid or matrix with color values assigned to each pixel. We can get a feel 
of how this works by using the Python Image Library. For this example, you will need to execute the 
following lines: 


from PIL import Image 

from matplotlib import pyplot as plt 

import numpy as np 

image= np.array (Image.open('data/sampleImage.jpg') ) 
plt.imshow(image, interpolation='nearest' ) 
plt.show() 

print (image. shape) 


Out[10]: (536, 800, 3) 
We can see that this particular image is 536 pixels wide and 800 pixels high. There are 3 values per 
pixel, representing color values between 0 and 255, for red, green, and blue respectively. Note that 


the co-ordinate system's origin (0,0) 1s the top left corner. Once we have our images as NumPy 
arrays, We can start working with them in interesting ways, for example, taking slices: 


im2=image[0:100,0:100,2] 
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Data from application programming interfaces 


Many social networking platforms have Application programming interfaces (APIs) that give the 
programmer access to various features. These interfaces can generate quite large amounts of 
streaming data. Many of these APIs have variable support for Python 3 and some other operating 
systems, so be prepared to do some research regarding the compatibility of systems. 


Gaining access to a platform's API usually involves registering an application with the vendor and 
then using supplied security credentials, such as public and private keys, to authenticate your 
application. 


Let's take a look at the Twitter API, which is relatively easy to access and has a well-developed 
library for Python. To get started, we need to load the Twitter library. If you do not have it already, 
simply execute the pip install twitter command from your Python command prompt. 


You will need a Twitter account. Sign in and go to apps.twitter.com. Click on the Create New App 
button and fill out the details on the Create An Application page. Once you have submitted this, you 
can access your credential information by clicking on your app from the application management page 
and then clicking on the Keys and Access Tokens tab. 


The four items we are interested in here are the API Key, the API Secret, The Access token, and the 
Access Token secret. Now, to create our Twitter object: 


from twitter import Twitter, OAuth 
#create our twitter object 
t = Twitter (auth=OAuth (accesToken, secretToken, apiKey, apiSecret) ) 


#get our home time line 
home=t.statuses.home timeline () 


#get a public timeline 
anyone= t.statuses.user timeline(screen name="abc730") 


#search for a hash tag 
pycon=t.search.tweets (q="#pycon") 


#The screen name of the user who wrote the first 'tweet' 
user=anyone[0]['user']['screen name'] 


#time tweet was created 
created=anyone[0]['created at'] 


#the text of the tweet 
text= anyone[0]['text'] 


You will, of course, need to fill in the authorization credentials that you obtained from Twitter earlier. 


Remember that in a publicly accessible application you never have these credentials in a human- 
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readable form, and certainly not in the file itself, and preferably encrypted outside a public directory. 
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Signals 


A form of data that is often encountered in primary scientific research 1s various binary streams. 
There are specific codecs for video and audio transmission and storage, and often, we are looking for 
higher-level tools to deal with each specific format. There are various signal sources we might be 
considering such as from a radio telescopes, sensor on a camera, or the electrical impulses froma 
microphone. Signals all share the same underlying principles based on wave mechanics and harmonic 
motion. 


Signals are generally studied using time frequency analysis. The central concept here 1s that a 
continuous signal 1n time and space can be decomposed into frequency components. We use what 1s 
known as a Fourier Transform to move between the time and frequency domains. This utilizes the 
interesting fact that states that any given function, including non periodic functions, can be represented 
by a series of sine and cosine functions. This is illustrated by the following: 


ii? 


F(x)=— “0 + ( a, cosnx +b sinnx ) 


w=] 


To make this useful, we need to find the values for a, and b,. We do this by multiplying both sides of 


the equation cosine, mx, and integrating. Here m is an integer. 


= ee ti 


rie at iT 
f (x) cosmx ax = , | cosmx dx + 2a, | cosix cosmx dx+b, | sinnx cosmx dx 
} : 


ee | 


it 


This is called an orthogonal function, in a similar notion to how we consider x, y, and z to be 
orthogonal in a vector space. Now, if you can remember all your trigonometric functions, you will 
know that sine times cosine with integer coefficients is always zero between negative pi and pi. If we 
do the calculation, it turns out that the middle term on the left-hand side is zero, except when n equals 
m. In this case, the term equals pi. Knowing this, we can write the following: 


_T] [_f (x) cos nx dx 


So, in the first step, if we multiply by sin mx instgad.of cosine mx, then we can derive the value of by. 
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er: : 
bh =I] | ({x)sinnx dx 
}_,f(*) 


We can see that we have decomposed a signal into a series of sine values and cosine values. This 
enables us to separate the frequency components of a signal. 
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Data from sound 


One of the most common and easy to study signals is audio. We are going to use the soundfile 
module. You can install it via pip if you do not have it. The soundfile module has a wavfile.read 
class that returns the . wav file data as a NumPy array. To try the following code, you will need a short 
16 bit wave file called audioSamp.wav. This can be downloaded from davejulian.net/mlbook. Save 
it in your data directory, in your working directory: 


import soundfile as sf 
import matplotlib.pyplot as plt 
import numpy as np 


sig, samplerate = sf.read('data/audioSamp.wav' ) 
Sig.shape 


We see that the sound file is represented by a number of samples, each with two values. This is 
effectively the function as a vector, which describes the . wav file. We can, of course, create slices of 
our sound file: 


slice=sig[0:500,:] 


Here, we slice the first 500 samples. Let's calculate the Fourier transform of the slice and plot it: 


ft=np.abs(np.fft.fft(slice) ) 
Finally lets plot the result 
plt.plot(ft) 

plt.plot (slice) 


The output of the preceding commands is as follows: 


1000 820000 30000 40000 50000 680000 OOOO #80000 
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Cleaning data 


To gain an understanding of which cleaning operations may be required for a particular dataset, we 
need to consider how the data was collected. One of the major cleaning operations involves dealing 
with missing data. We have already encountered an example of this 1n the last chapter, when we 
examined the temperature data. In this instance, the data had a quality parameter, so we could simply 
exclude the incomplete data. However, this may not be the best solution for many applications. It may 
be necessary to fill in the missing data. How do we decide what data to use? In the case of our 
temperature data, we could fill the missing values in with the average values for that time of year. 
Notice that we presuppose some domain knowledge, for example, the data is more or less periodic; it 
is in line with the seasonal cycle. So, it 1s a fair assumption that we could take the average for that 
particular date for every year we have a reliable record. However, consider that we are attempting to 
find a signal representing an increase in temperature due to climate change. In that case, taking the 
average for all years would distort the data and potentially hide a signal that could indicate warming. 
Once again, this requires extra knowledge and 1s specific about what we actually want to learn from 
the data. 


Another consideration is that missing data may be one of three types, which are as follows: 


® empty 
@® zero 


® null 


Different programming environments may treat these slightly differently. Out of the three, only zero 1s 
a measurable quantity. We know that zero can be placed on a number line before 1, 2, 3, and so on, 
and we can compare other numbers to zero. So, normally zero 1s encoded as numeric data. Empties 
are not necessarily numeric, and despite being empty, they may convey information. For example, if 
there is a field for middle name ina form, and the person filling out the form does not have a middle 
name, then an empty field accurately represents a particular situation, that 1s, having no middle name. 
Once again, this depends on the domain. In our temperature data, an empty field indicates missing 
data as it does not make sense for a particular day to have no maximum temperature. Null values, on 
the other hand, in computing, mean something slightly different from its everyday usage. For the 
computer scientist, null is not the same thing as no value or zero. Null values cannot be compared to 
anything else; they indicate that a field has a legitimate reason for not having an entry. Nulls are 
different than empty values. In our middle name example, a null value would indicate that it is 
unknown if the person has a middle name or not. 


Another common data cleaning task is converting the data to a particular format. For our purposes 
here, the end data format we are interested in 1s a Python data structure such as a NumPy array. We 
have already looked at converting data from the JSON and HTML formats, and this 1s fairly straight 
forward. 


Another format that we are likely to come across 1s the Acrobats Portable Document Format (PDF). 
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Importing data from PDF files can be quite difficult because PDF files are built on page layout 
primitives, and unlike HTML or JSON, they do not have meaningful markup tags. There are several 
non-Python tools for turning PDFs into text such as pdftotext. This is a command line tool that 1s 
included in many Linux distributions and is also available for Windows. Once we have converted the 
PDF file into text, we still need to extract the data, and the data embedded in the document determines 
how we can extract it. If the data 1s separated from the rest of the document, say ina table, then we 
can use Python's text parsing tools to extract it. Alternatively, we can use a Python library for working 
with PDF documents such as pdfminer3k. 


Another common cleaning task is converting between data types. There 1s always the risk of losing 
data when converting between types. This happens when the target type stores less data than the 
source, for instance, converting to float 16 from float 32. Sometimes, we need to convert data at the 
file level. This occurs when a file has an implicit typing structure, for example, a spreadsheet. This is 
usually done within the application that created the file. For example, an Excel spreadsheet can be 
saved as a comma separated text file and then imported into a Python application. 
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Visualizing data 


There are a number of reasons for why we visually represent the data. At the data exploration stage, 
we can gain an immediate understanding of data properties. Visual representation serves to highlight 
patterns in data and suggest modeling strategies. Exploratory graphs are usually made quickly and 1n 
large numbers. We are not so much concerned with aesthetic or stylistic issues, but we simply want to 
see what the data looks like. 


Beyond using graphs to explore data, they are a primary means of communicating information about 
our data. Visual representation helps clarify data properties and stimulate viewer engagement. The 
human visual system 1s the highest bandwidth channel to the brain, and visualization is the most 
efficient way to present a large amount of information. By creating a visualization, we can 
immediately get a sense of important parameters, such as the maximum, minimum, and trends that may 
be present in the data. Of course, this information can be extracted from data through statistical 
analysis, however, analysis may not reveal specific patterns in the data that visualization will. The 
human visual pattern recognition system 1s, at the moment, significantly superior to that of a machine. 
Unless we have clues as to what we are looking for, algorithms may not pick out important patterns 
that a human visual system will. 


The central problem for data visualization 1s mapping data elements to visual attributes. We do this by 
first classifying the data types as nominal, ordinal, or quantitative, and then determining which visual 
attributes represent each data type most effectively. Nominal or categorical data refers to a name, 
such as the species, male or female, and so on. Nominal data does not have a specific order or 
numeric value. Ordinal data has an intrinsic order, such as house numbers in a street, but is different 
from quantitative data in that it does not imply a mathematical interval. For example, it does not make 
much sense to multiply or divide house numbers. Quantitative data has a numeric value such as size or 
volume. Clearly, certain visual attributes are inappropriate for nominal data, such as size or position; 
they imply ordinal or quantitative information. 


Sometimes, it 1s not immediately clear what each data type 1n a particular dataset is. One way to 
disambiguate this is to find what operations are applicable for each data type. For example, when we 
are comparing nominal data, we can use equals, for instance, the species Whitefly is not equal to the 
species Thrip. However, we cannot use operations such as greater than or less than. It does not make 
sense to say, 1n an ordinal sense, that one species is greater than another. With ordinal data, we can 
apply operations such as greater than or less than. Ordinal data has an implicit order that we can map 
on a number line. For quantitative data, this consists of an interval, such as a date range, to which we 
can apply additional operations such as subtractions. For example, we can not only say that a 
particular date occurs after another date, but we can also calculate the difference between the two 
dates. With quantitative data that has a fixed axis, that is a ratio of some fixed amount as opposed to 
an interval, we can use operations such as division. We can say that a particular object weighs twice 
as much or is twice as long as another object. 


Once we are clear on our data types, we can stat hsipping them to attributes. Here, we will consider 
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six visual attributes. They are position, size, texture, color, orientation, and shape. Of these, only 
position and size can accurately represent all three types of data. Texture, color, orientation, and 
shape, on the other hand, can only accurately represent nominal data. We cannot say that one shape or 
color is greater than another. However, we can associate a particular color or texture with a name. 


Another thing to consider is the perceptual properties of these visual attributes. Research in 
psychology and psycho physics have established that visual attributes can be ranked in terms of how 
accurately they are perceived. Position 1s perceived most accurately, followed by length, angle, 
Slope, area, volume, and finally, color and density, which are perceived with the least accuracy. It 
makes sense, therefore, to assign position and then length to the most important quantitative data. 
Finally, it should also be mentioned that we can encode, to some extent, ordinal data 1n a colors value 
(from dark to light) or continuous data in a color gradient. We cannot generally encode this data ina 
colors hue. For instance, there is no reason to perceive the color blue as somehow greater than the 
color red, unless you are making a reference to its frequency. 





The color gradient to represent ordinal data 


The next thing to consider is the number of dimensions that we need to display. For uni-variate data, 
that 1s, where we only need to display one variable, we have many choices such as dots, lines, or box 
plots. For bi-variate data, where we need to display two dimensions, the most common 1s with a 
scatter plot. For tri-variate data, itis possible to use a 3D plot, and this can be useful for plotting 
geometric functions such as manifolds. However, 3D plots have some drawbacks for many data types. 
It can be a problem to work out relative distances on a 3D plot. For instance, in the following figure, 
itis difficult to gauge the exact positions of each element. However, if we encode the z dimension as 
size, the relative values become more apparent: 
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Encoding Three Dimensions 


There is a large design space for encoding data into visual attributes. The challenge is to find the best 
mapping for our particular dataset and purpose. The starting point should be to encode the most 
important information in the most perceptually accurate way. Effective visual coding will depict all 
the data and not imply anything that 1s not in the data. For example, length implies quantitative data, so 
encoding non-quantitative data into length is incorrect. Another aspect to consider 1s consistency. We 
should choose attributes that make the most sense for each data type and use consistent and well- 
defined visual styles. 
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Summary 


You have learned that there are a large number of data source, formats, and structures. You have 
hopefully gained some understanding of how to begin working with some of them. It 1s important to 
point out that in any machine learning project, working with the data at this fundamental level can 
comprise a significant proportion of the overall project development time. 


In the next chapter, we will look at how we can put our data to work by exploring the most common 
machine learning models. 
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Chapter 4. Models — Learning from 
Information 


So far in this book, we have examined a range of tasks and techniques. We introduced the basics of 
data types, structures, and properties, and we familiarized ourselves with some of the machine 
learning tools that are available. 


In this chapter, we will look at three broad types of model: 


e Logical models 
e Tree models 
e Rule models 


The next chapter will be devoted to another important type of model—the linear model. Much of the 
material in this chapter 1s theoretical, and its purpose is to introduce some of the mathematical and 
logical tools needed for machine learning tasks. | encourage you to work through these ideas and 
formulate them in ways that may help solve problems that we come across. 
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Logical models 


Logical models divide the instance space, that 1s the set of all possible or allowable, instances, into 
segments. The goal is to ensure that the data in each segment 1s homogeneous with respect to a 
particular task. For example, if the task is classification, then we aim to ensure that each segment 
contains a majority of instances of the same class. 


Logical models use logical expressions to explain a particular concept. The simplest and most 
general logical expressions are literals, and the most common of these 1s equality. The equality 
expression can be applied to all types—nominative, numerical, and ordinal. For numerical and 
ordinal types, we can include the inequality literals: greater than or less than. From here, we can 
build more complex expressions using four logical connectives. These are conjunction (logical 
AND), which is denoted by A; disjunction (logical OR), which is denoted by V; implication, which 
is denoted by —; and negation, which is denoted by _. This provides us with a way to express the 
following equivalences: 


ppd =A=A-B= r4 V B 
p(AA B= (AV (B= (AV B= (AA 7B 


We can apply these ideas in a simple example. Let's say you come across a grove of trees that all 
appear to be from the same species. Our goal 1s to identify the defining features of this tree species 
for use 1n a classification task. For simplicity sake, let's say we are just dealing with the following 
four features: 


Size: This has three values—small, medium, and large 
Leaf type: This has two values—scaled or non-scaled 
Fruit: This has two values—yes or no 

Buttress: This has two values—yes or no 


The first tree we identify can be described by the following conjunction: 
Size = Large /\ Leaf = Scaled (\ Fruit = No /(\ Buttress = Yes 


The next tree that we come across is medium-sized. If we drop the size condition, then the statement 
becomes more general. That is, 1t will cover more samples: 


Leaf = Scaled /\ Fruit = No /\ Buttress = Yes 


The next tree is also medium-sized, but it does not have buttresses, so we remove this condition and 
generalize it to the following: 
Leaf = Scaled 4\ Fruit = No 


The trees in the grove all satisfy this conj unctOR.ab Pwieapgnelude that they are conifers. Obviously, 


in a real-world example, we would use a greater range of features and values and employ more 
complex logical constructs. However, even in this simple example, the instance space is 3 2 2 2, 
which makes 24 possible instances. If we consider the absence of a feature as an additional value, 
then the hypothesis space, that 1s, the space that we can use to describe this set, is 4 3 3 3 = 108. The 


number of sets of instances , or extensions, that are possible is 244 For example if you were to 
randomly choose a set of in. For example 1f you were to randomly choose a set of instances, the odds 
that you could find a conjunctive concept that exactly describes them is well over 100,000 to 
one.stances, the odds that you could find a conjunctive concept that exactly describes them 1s well 
over 100,000 to one. 
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Generality ordering 


We can begin to map this hypothesis space from the most general statements to the most specific 
statements. For example, in the neighborhood of our conifer hypothesis, the space looks like this: 


| Size=L_ Scale=¥ | | Fruit=M | - Butt=¥ 


“Size=L, Fruit=N | (See= L Scale=¥| [ Size=L, But=Y| Scale=V.Fruit=N| (Scale=YButt=V) [ Fruit=N. Butt=¥ 


oe [... a 1 a) | Size= L. Scale= -Y Butt= | Size=L, FruieN, Butt=¥ Scale=¥ eure Butt=¥ 


ser Se peal Raphae 


Size=L, Scale=—¥_ Fruie&N. Butt=¥ 





Here, we are ordering our hypothesis by generality. At the top 1s the most general hypothesis—all 
trees are conifers. The more general hypothesis will cover a greater number of instances, and 
therefore the most general hypothesis, that 1s, all trees are conifers, applies to all instances. Now, 
while this might apply to the grove we are standing 1n, when we attempt to apply this hypothesis to 
new data, that is, to trees outside the grove, it will fail. At the bottom of the preceding diagram, we 
have the least general hypothesis. As we make more observations and move up through the nodes, we 
can eliminate hypothesis and establish the next most general complete hypothesis. The most 
conservative generalization we can make from the data 1s called the least general generalization 
(LGG) of these instances. We can understand this as being the point in the hypothesis space where the 
paths upward from each of the instances intersect. 


Let's describe our observations ina table: 
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Sooner or later, of course, you wander out of the grove and you observe negative examples—trees 
that are clearly not conifers. You note the following features; 





So, with the addition of the negative examples, we can still see that our least general complete 
hypothesis is still Scale = Y A Fruit =N. However, you will notice that a negative example, 4, is 
covered. The hypothesis is therefore not consistent. 
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Version space 


This simple example may lead you to the conclusion that there 1s only one LGG. But this 1s not 
necessarily true. We can expand our hypothesis space by adding a restricted form of disjunction 
called internal disjunction. In our previous example, we had three positive examples of conifers 
with either medium or large size. We can add a condition Size = Medium V Size = Large, and we 
can write this as size /m,l/. Internal disjunction will only work with features that have more than two 
values because something like Leaves = Scaled V Leaves = Non-Scaled is always true. 


In the previous conifer example, we dropped the size condition to accommodate our second and third 
observations. This gave us the following LGG: 


Leaf = Scaled \ Leaf = = No 

Given our internal disjunction, we can rewrite the preceding LGG as follows: 
Size[m,l] \ Leaf = Scaled \ Fruit = No 

Now, consider the first non-conifer, or negative non-conifer example: 

Size = Small Leaf =Non-scaled \ Fruit = No 


We can drop any of the three conditions in the LGG with the internal disjunction without covering this 
negative example. However, when we attempt to generalize further to single conditions, we see that 
Size/m,l] and Leaf = Scaled are OK but Fruit = No 1s not, since it covers the negative example. 


Now, we are interested 1n the hypothesis that 1s both complete and consistent, that is, it covers all 
the positive examples and none of the negative. Let's now redraw our diagram considering just our 
four positive (p/ - p4) examples and one negative example (”/). 


WOW! eBook 
www.wowebook.org 


size = (m,|) scale=Y 


- ee 
— = i 
es “sy ‘ “a, 
A in 7 
= s 
= *. Z i 
_ *, “ . 
an Se p t_ 
“ ra . 
“ - 


= ae 
= “on, 
a aaa 
Sa 


a, 


size = (mI), Eruit=N ‘size = (m,]), Scale=Y scale=Y,fruit=N 


size = (m,|l), scale=Y,fruit=N 





This 1s sometimes referred to as the version space. Note that we have one least general hypothesis, 
three intermediate, and, now, two most general hypotheses. The version space forms a convex set. 
This means we can interpolate between members of this set. If an element lies between a most general 
and least general member of the set, then 1t is also a member of that set. In this way, we can fully 
describe the version space by its most and least general members. 


Consider a case where the least general generalization covers one or more of the negative instances. 
In such cases, we can say that the data 1s not conjunctively separable and the version space is empty. 
We can apply different approach whereby we search for the most general consistent hypothesis. Here 
we are interested 1n consistency as opposed to completeness. This essentially involves iterating 
through paths in the hypothesis space from the most general. We take downward steps by, for 
example, adding a conjunct or removing a value from an internal conjunct. At each step, we minimize 
the specialization of the resulting hypothesis. 
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Coverage space 


When our data is not conjunctively separable, we need a way to optimize between consistency and 
completeness. A useful approach 1s in terms of mapping the coverage space of positive and negative 
instances, as shown 1n the following diagram: 
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We can see that learning a hypothesis involves finding a path through the hypothesis space ordered by 
generality. Logical models involve finding a pathway through a latticed structured hypothesis space. 
Each hypothesis 1n this space covers a set of instances. Each of these sets has upper and lower 
bounds, in and are ordered by, generality. So far, we have only used single conjunctions of literals. 
With a rich logical language at our disposal, why not incorporate a variety of logical connectives into 
our expressions? There are basically two reasons why we may want to keep our expressions simple, 
as follows: 


e More expressive statements lead to specialization, which will result in a model overfitting 
training data and performing poorly on test data 
e Complicated descriptions are computationally more expensive than simple descriptions 


As we saw when learning about the conjunctive hypothesis, uncovered positive examples allow us to 
drop literals from the conjunction, making it more general. On the other hand, covered negative 


examples require us to increase specializatiomowadkehalg literals. 
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Rather than describing each hypothesis in terms of conjunctions of single literals, we can describe it 
in terms of disyunctions of clauses, where each clause can be of the form A — B. Here, Ais a 
conjunction of literals and B is a single literal. Let's consider the following statement that covers a 
negative example: 


Butt =Y A Scaled =N A Size =S A 7 Fruit =N 
To exclude this negative example, we can write the following clause: 
Butt = Y A Scaled =N A Size = S > Fruit =N 


There are of course, other clauses that exclude the negative, such as Butt = Y— Fruit = N; however, 
we are interested in the most specific clause because it is less likely to also exclude covered 
positives. 
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PAC learning and computational complexity 


Given that, as we increase the complexity of our logical language, we impose a computational cost, 
we need a metric to gauge the /earnability of a language. To these ends, we can use the idea of 
Probably Approximately Correct (PAC) learning. 


When we select one hypothesis from a set of hypotheses, the goal 1s to ensure that our selection will 
have, with high probability, a low generalization error. This will perform with a high degree of 
accuracy on a test set. This introduces the idea of computational complexity. This is a formalization 
to gauge the computational cost of a given algorithm in relation to the accuracy of its output. 


PAC learning makes allowance for mistakes on non-typical examples, and this typicality is 
determined by an unspecified probability distribution, D. We can evaluate an error rate of a 
hypothesis with respect to this distribution. For example, let's assume that our data is noise-free and 
that the learner always outputs a complete and consistent hypothesis within the training samples. Let's 
choose an arbitrary error rate €« < 0.5 and a failure rate 6= 0.5. We require our learning algorithm to 
output a hypothesis that has a probability > 1 - 6 such that the error rate will be less than e. It turns out 
that this will always be true for any reasonably sized training set. For example, if our hypothesis 
space, H, contains a single bad hypothesis, then the probability that it 1s complete and consistent on n 


independent training samples is less than or equal to (J - ©)”. For any 0 <e€ < J, this probability is 
less than e-n €. We need to keep this below our error rate, 0, which we achieve by setting n > 1/ € In 
1/0. Now, if H contains a number of bad hypotheses, k < | A |, then the probability that at least one of 
them is complete and consistent on n independent samples is at maximum: 


k(l-@On<|H|Ud1-On<|A|e-ne 
This maximum will be less than f if the following condition 1s met: 


ns 2 in +n] 
O 


E 


This 1s known as the sample complexity and you will notice that it is logarithmic in //o and linear in 
T/e. 


Note 


This implies that 1t is exponentially cheaper to reduce the failure rate than it is to reduce the error 
rate. 


To conclude this section, I will make one further point. The hypothesis space His a subset of U, a 


universe of explanation for any given phenomenawHeveldo we know whether the correct hypothesis 
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actually exists inside H rather than elsewhere in U? Bayes theorem shows a relationship between the 
relative probabilities of H and - Has well as their relative prior probabilities. However, there is no 
real way we can know the value of P y H because there is no way to calculate the probabilities of a 
hypothesis that has not yet been conceived. Moreover, the contents of this hypothesis consist of a, 
currently unknown, universe of possible objects. This paradox occurs in any description that uses 
comparative hypothesis checking where we evaluate our current hypothesis against other hypotheses 
within H. Another approach would be to find a way to evaluate H. We can see that, as we expand H, 
the computability of hypothesis within it becomes more difficult. To evaluate H, we need to restrict 
our universe to the universe of the known. For a human, this is a life of experiences that has been 
imprinted in our brains and nervous system; for a machine, it 1s the memory banks and algorithms. The 
ability to evaluate this global hypothesis space 1s one of the key challenges of artificial intelligence. 
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Tree models 


Tree models are ubiquitous in machine learning. They are naturally suited to divide and conquer 
iterative algorithms. One of the main advantages of decision tree models 1s that they are naturally easy 
to visualize and conceptualize. They allow inspection and do not just give an answer. For example, if 
we have to predict a category, we can also expose the logical steps that give rise to a particular 
result. Also tree models generally require less data preparation than other models and can handle 
numerical and categorical data. On the down side, tree models can create overly complex models that 
do not generalize to new data very well. Another potential problem with tree models is that they can 
become very sensitive to changes in the input data and, as we will see later, this problem can be 
mitigated against using them as ensemble learners. 


An important difference between decision trees and the hypothesis mapping used 1n the previous 
section 1s that the tree model does not use internal disjunction on features with more than two values 
but instead branches on each value. We can see this with the size feature in the following diagram: 


WOW! eBook 
www.wowebook.org 


Scaled 





saa 
ae Re 
Fruit +, S- 
4+, 1- 
no/ 
Size =L 
4+, 1- 
yes’ \ no 
Butt Size =M 
| 1+, O- af, 1- 
ye 4 ye 5 / \ ie 
4 ay Oe | Zé & 
Butt Butt 
| fea 0+,0- 
yes / \ no 
1+. Q- 1+.1- 


Another point to note 1s that decision trees are more expressive than the conjunctive hypothesis and 
we can see this here, where we have been able to separate the data where the conjunctive hypothesis 


covered negative examples. This expressi yen ss, Ol CONtse, comes witha price: the tendency to 


overfit on training data. A way to force generalization and reduce overfitting 1s to introduce an 
inductive bias toward less complex hypotheses. 


We can quite easily implement our little example using the Sklearn DecisionTreeClassifier and 
create an image of the resultant tree: 


from sklearn import tree 


names=['size','scale','fruit', 'butt' ] 
labels=[1,1,1,1,1,0,0,0] 


p1l=[2,1,0,1] 
p2=[1,1,0,1] 
p3=[1,1,0,0] 
p4=[1,1,0,0] 
n1i=[0,0,0,0] 
n2=[1,0,0,0] 
n3=[0,0,1,0] 
n4=[1,1,0,0] 
data=[pl1,p2,p3,p4,n1,n2,n3,n4] 


def pred(test, data=data): 
dtre=tree.DecisionTreeClassifier () 
dtre=dtre.fit(data,labels) 
print (dtre.predict([test] ) ) 
with open('data/treeDemo.dot', 'w') as f: 
f=tree.export graphviz(dtre,out file=f, 
feature names=names) 
pred([1,1,0,1]) 


Running the preceding code creates a treeDemo.dot file. The decision tree classifier, saved as a 
.dot file, can be converted into an image file such as a .png, .jpeg or .gif using the Graphiz graph 
visualization software. You can download Graphviz from http://graphviz.org/Download.php. Once 
you have it installed, use it to convert the . dot file into an image file format of your choice. 


This gives you a clear picture of how the decision tree has been split. 
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) fruit <= 0.5000 
ini = 0.46875 
sanples = 8 


butt <= 0.5000 gum = 0.0000 
gi = 0.408163265306 samples = 1 
samples = 7 value =[1. 0.] 


| size <= 0.5000 sini = 0.0000 
gun = 0.46 samples = 2 
samples = 45 value =[0. 2.] 


gin = 0.0000 
samples = 1 g S 
value =[0. 1.] samples = 4 


samples = 1 samples = 3 
value =[1. 0.] value =[1. 2.] 





We can see from the full tree that we recursively split on each node, increasing the proportion of 
samples of the same class with each split. We continue down nodes until we reach a leaf node where 
we aim to have a homogeneous set of instances. This notion of purity is an important one because it 
determines how each node 1s split and it 1s behind the Gini values 1n the preceding diagram. 
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Purity 


How do we understand the usefulness of each feature in relation to being able to split samples into 
classes that contain minimal or no samples from other classes? What are the indicative sets of 
features that give a class its label? To answer this, we need to consider the idea of purity of a split. 
For example, consider we have a set of Boolean instances, where D 1s split into D/ and D2. If we 


further restrict ourselves to just two classes, DP?’ and D5, we can see that the optimum situation is 
where D is split perfectly into positive and negative examples. There are two possibilities for this: 


either where D/P?* = DP°® and DI"S = f}, or DIM€S = DS and DIP? = f}. 


If this is true, then the children of the split are said to be pure. We can measure the impurity of a split 
by the relative magnitude of n??° and n”°S. This is the empirical probability of a positive class and it 


can be defined by the proportion p=n??* /(nP?’ + nS). There are several requirements for an 
impurity function. First, if we switch the positive and negative class (that is, replace p with /-p) then 
the impurity should not change. Also the function should be zero when p=0 or p=1/, and it should 
reach its maximum when p=0.5. In order to split each node in a meaningful way, we need an 
optimization function with these characteristics. 


There are three functions that are typically used for impurity measures, or splitting criteria, with the 
following properties. 


e Minority class: This is simply a measure of the proportion of misclassified examples assuming 
we label each leaf with the majority class. The higher this proportion 1s, the greater the number 
of errors and the greater the impurity of the split. This is sometimes called the classification 
error, and is calculated as min(p, 1-p). 

e Gini index: This is the expected error if we label examples either positive, with probability p, 
or negative, with probability /-p. Sometimes, the square root of the Gini index is used as well, 
and this can have some advantages when dealing with highly skewed data where a large 
proportion of samples belongs to one class. 

e Entropy: This measure of impurity 1s based on the expected information content of the split. 
Consider a message telling you about the class of a series of randomly drawn samples. The 
purer the set of samples, the more predictable this message becomes, and therefore the smaller 
the expected information. Entropy 1s measured by the following formula: 


—plog, p- (1 7 p) log, (1 i p) 


These three splitting criteria, for a probability range of between 0 and /, are plotted 1n the following 
diagram. The entropy criteria are scaled by 0.5 to enable them to be compared to the other two. We 
can use the output from the decision tree to see where each node lies on this curve. 
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Rule models 


We can best understand rule models using the principles of discrete mathematics. Let's review some 
of these principles. 


Let_X be a set of features, the feature space, and C be a set of classes. We can define the ideal 
classifier for _X as follows: 


ce: XC 
A set of examples in the feature space with class c is defined as follows: 
D=£(x], C(x), .., Kp C(Xp)) SXXC 


A splitting of X 1s partitioning X into a set of mutually exclusive subsets X7....X,, so we can say the 
following: 


X=XIU.. U Xs 


This induces a splitting of D into Dj,...D,. We define Dj where j = J/,...,s and is {(x,c(xy) © D|x © 
Xj}. 


This 1s just defining a subset in_X called _X7 where all the members of X7 are perfectly classified. 


In the following table we define a number of measurements using sums of indicator functions. An 
indicator function uses the notation where J//.../ 1s equal to one 1f the statement between the square 
brackets is true and zero 1f itis false. Here tc(x) 1s the estimate of c(x). 


Let's take a look at the following table: 


Number of positives 


N= Loti | c( ¥)= neg | 


Number of negatives 


TP = 2d ep)! | Tc (x)=c(x)= pos | 


True positives 
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Err 
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True positive rate (sensitivity, recall) (Zit yf é (2 x )= / 008 ]}) 








True negative rate (negative recall) 
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Precision, confidence Feet el 7 _ P 
(x, (veD) M | rc(x) }= = pos |) = (TP + FP) 


prec = 





Rule models comprise not only sets or lists of rules, but importantly, a specification on how to 
combine these rules to form predictions. They are a logical model but differ from the tree approach 1n 
that, trees split into mutually exclusive branches, whereas rules can overlap, possibly carrying 
additional information. In supervised learning there are essentially two approaches to rule models. 
One is to find a combination of literals, as we did previously, to form a hypothesis that covers a 
sufficiently homogeneous set of samples, and then find a label. Alternatively, we can do the opposite; 
that 1s, we can first select a class and then find rules that cover a sufficiently large subset of samples 
of that class. The first approach tends to lead to an ordered list of rules, and 1n the second approach, 
rules are an unordered set. Each deals with overlapping rules in its own characteristic way, as we 
will see. Let's look at the ordered list approach first. 
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The ordered list approach 


As we add literals to a conjunctive rule, we aim to increase the homogeneity of each subsequent set 
of instances covered by the rule. This 1s similar to constructing a path in the hypothesis space as we 
did for our logical trees 1n the last section. A key difference with the rule approach is that we are only 
interested 1n the purity of one of the children, the one where the added literal is true. With tree-based 
models, we use the weighted average of both children to find the purity of both branches of a binary 
split. Here, we are still interested in calculating the purity of subsequent rules; however, we only 
follow one side of each split. We can still use the same methods for finding purity, but we no longer 
need to average over all children. As opposed to the divide and conquer strategy of decision trees, 
rule-based learning 1s often described as separate and conquer. 


Let's briefly consider an example using our conifer categorization problem from the previous section. 


SS ——— ————S i —— 


0+,3- || O+,1- || 4+3- || 14,4- | 3+,0- 


size=L_ | size=M || size=S_ || scaled=Y nner fruit=Y | fruit=N | butt=N butt=¥ 


1+,0- 2+,2- 1+,2- 4+,1- 





There are several options for choosing a rule that will result in the purest split. Supposing we choose 
the rule /f scaled = N then class is negative, we have covered three out of four negative samples. In 
the next iteration, we remove these samples from consideration and continue this process of searching 
for literals with maximum purity. Effectively, what we are doing is building an ordered list of rules 
joined with the if and else clauses. We can rewrite our rules to be mutually exclusive, and this 
would mean that the set of rules does not need to be ordered. The tradeoff here 1s that we would have 
to use either negated literals or internal disyunctions to deal with features that have more than two 
values. 


There are certain refinements we can make to this model. For example, we can introduce a stopping 
criterion that halts iteration 1f certain conditions are met, such as 1n the case of noisy data where we 
may want to stop iteration when the number of samples in each class falls below a certain number. 


Ordered rule models have a lot in common wiilodeersian trees, especially, in that, they use an 
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objective function based on the notion of purity that is the relative number of positive and negative 
class instances in each split. They have structures that are easy to visualize and they are used in many 
different machine learning settings. 


WOW! eBook 
www.wowebook.org 


Set-based rule models 


With set based rule models rules are learned one class at a time, and our objective function simply 
becomes maximize p, rather than minimizing min(p, /-p). Algorithms that use this method typically 
iterate over each class and only cover samples of each class that are removed after a rule is found. 
Set-based models use precision (refer to table 4-1) as a search heuristic and this can make the model 
focus too much on the purity of the rule; 1t may miss near pure rules that can be further specialized to 
forma pure rule. Another approach, called beam search, uses a heuristic to order a predetermined 
number of best partial solutions. 


Note 


Ordered lists give us a convex coverage for the training set. This is not necessarily true of the 
uncorded set-based approach where there is no global optimum order for a given set of rules. 
Because of this, we have access to rule overlaps expressed as a conjunction 4A B, where A and B are 
two rule sets. If these two rules are in an ordered list, we have either, if the order is 4B, A = (AA B) 
V (AA ,B) or, if the order is BA, B = (AAB) V (-AAB). This means that the rule space is 
potentially enlarged; however, because we have to estimate the coverage of overlaps, we sacrifice 
convexity. 


Rule models, in general, are well suited to predictive models. We can, as we will see later, extend 
our rule models to perform such tasks as clustering and regression. Another important application of 
rule models 1s to build descriptive models. When we are building classification models, we 
generally look for rules that will create pure subsets of the training samples. However, this not 
necessarily true if we are looking for other distinguishing characteristics of a particular sample set. 
This is sometimes referred to as subgroup discovery. Here, we are not interested in a heuristic that is 
based on class purity but rather 1n one that looks for distinguishing class distributions. This is done 
using a defined quality function based on the idea of local exceptional testing. This function can take 
the form g=7TP/(FP +g). Here g 1s a generalization factor that determines the allowable number of 
nontarget class instances relative to the number of instances covered by the rule. For a small value of 
g, say less than /, rules will be generated that are more specific because every additional nontarget 
example incurs greater relative expense. Higher values of g, say greater than /0, create more general 
rules covering more nontarget samples. There is no theoretical maximum value for g; however, it 
does not make much sense for it to exceed the number of samples. The value of g is governed by the 
size of the data and the proportion of positive samples. The value of g can be varied, thus guiding 
subgroup discovery to certain points in the 7P versus FP space. 


We can use subjective or objective quality functions. We can incorporate subjective interestingness 
coefficients into the model to reflect things such as understandability, unexpectedness, or, based on 
templates describing the interesting class, relationship patterns. Objective measurements are derived 
from the statistical and structural properties of the data itself. They are very amenable to the use of 
coverage plots to highlight subgroups that have statistical properties that differ from the population as 


a whole. 
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Finally, in this section on rule-based models, we will consider rules that can be learned entirely 
unsupervised. This is called association rule learning, and its typical use cases include data mining, 
recommender systems and natural language processing. We will use as an example a hardware shop 
that sells four items: hammers, nails, screws, and paint. 


Let's take a look at the following table: 


Hammers and nails 
Hammers, nails, pat, and screws 





In this table, we have grouped transactions with items. We could also have grouped each item with 
the transactions it was involved in. For example, nails were involved in transactions 1, 2, 3, 4, and 7, 
and hammers were involved in 2, 3, 4, and so on. We can also do this with sets of items, for example, 
hammers and nails were both involved in transactions 2, 3, and 4. We can write this as the item set 
{hammer, nails} covers the transaction set [2, 3,4]. There are 16 item sets including the empty set, 
which covers all transactions. 


The relationship between transaction sets forms a lattice structure connecting items with their 
respective sets. In order to build associative rules, we need to create frequent item sets that exceed 
the threshold F’7. For example, a frequent item set where Fp = 31S {screws}, {hammer,nails}, and 


{paint}. These are simply the items sets that are associated with three or more transactions. The 
following is a diagram showing part of the lattice from our example. In a similar way, we found the 
least general generalization in our hypothesis space mapping. Here, we are interested in the lowest 
boundary of the largest item set. In this example, itis {nails,hammer}. 
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{nails hammer,paint,screws} 
f=1 


We can now create association rules of the form if A then B, where A and B are item sets that 


frequently appear together 1n a transaction. If we select an edge on this diagram, say the edge between 
{nails} witha frequency of 5, and {nails, hammer} witha frequency of 3, then we can say that the 
confidence of the association rule if nails then hammer is 3/5. Using a frequency threshold together 
with the confidence of a rule, an algorithm can find all rules that exceed this threshold. This is called 
association rule mining, and it often includes a post-processing phase where unnecessary rules are 
filtered out, for example—where a more specific rule does not have a higher confidence than a more 


general parent. 
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Summary 


We began this chapter by exploring a logical language and creating a hypothesis space mapping for a 
simple example. We discussed the idea of least general generalizations and how to find a path through 
this space from the most general to the least general hypothesis. We briefly looked at the concept of 
learnability. Next, we looked at tree models and found that they can be applied to a wide range of 
tasks and are both descriptive and easy to interpret. Trees by themselves, however, are prone to 
overfitting and the greedy algorithms employed by most tree models can be prone to over-sensitivity 
to initial conditions. Finally, we discussed both ordered rule lists and unordered rule set-based 
models. The two different rule models are distinguished by how they handle rule overlaps. The 
ordered approach 1s to find a combination of literals that will separate the samples into more 
homogeneous groups. The unordered approach searches for a hypotheses one class at a time. 


In the next chapter, we will look at quite a different type of model—the linear model. These models 
employ the mathematics of geometry to describe the problem space and, as we will see, form the 
basis for support vector machines and neural nets. 
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Chapter 5. Linear Models 


Linear models are one of the most widely used models and form the foundation of many advanced 
nonlinear techniques such as support vector machines and neural networks. They can be applied to 
any predictive task such as classification, regression, or probability estimation. 


When responding to small changes in the input data, and provided that our data consists of entirely 
uncorrelated features, linear models tend to be more stable than tree models. As we mentioned in the 
last chapter, tree models can over-respond to small variations in training data. This is because splits 
at the root of a tree have consequences that are not recoverable further down the line, that is, 
producing different branching and potentially making the rest of the tree significantly different. Linear 
models on the other hand are relatively stable, being less sensitive to initial conditions. However, as 
you would expect, this has the opposite effect, changing less sensitive data to nuanced data. This 1s 
described by the terms variance (for over fitting models) and bias (for under fitting models). A linear 
model is typically low-variance and high-bias. 


Linear models are generally best approached from a geometric perspective. We know we can easily 
plot two dimensions of space in a Cartesian co-ordinate system, and we can use the illusion of 
perspective to illustrate a third. We have also been taught to think of time as being a fourth dimension, 
but when we start speaking of n dimensions, a physical analogy breaks down. Intriguingly, we can 
still use many of the mathematical tools that we intuitively apply to three dimensions of space. While 
it becomes difficult to visualize these extra dimensions, we can still use the same geometric concepts, 
such as lines, planes, angles, and distance, to describe them. With geometric models, we describe 
each instance as having a set of real-value features, each of which is a dimension in our geometric 
space. Let's begin this chapter with a review of the formalism associated with linear models. 


We have already disused the basic numerical linear model solution by the least squared method for 
two variables. It is straightforward and easy to visualize on a 2D coordinate system. When we try to 
add parameters, as we add features to our model, we need a formalism to replace, or augment, an 
intuitive visual representation. In this chapter, we will be looking at the following topics: 


The least squares method 
The normal equation method 
Logistic regression 
Regularization 


Let's start with the basic model. 
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Introducing least squares 


In a simple one-feature model, our hypothesis function 1s as follows: 


h (x) =W,+wx 


If we graph this, we can see that it is a straight line crossing the y axis at wy and having a slope of w7. 


The aim of a linear model 1s to find the parameter values that will create a straight line that most 
closely matches the data. We call these the functions parameter values. We define an objective 
function, Jj, which we want to minimize: 


z 
= 


FT 


mind, = (A, (x"” — y) 
2M ‘a : 


Here, m is the number of training samples, hy oc) ) is the estimated value of the jth training sample, 


and yl is its actual value. This is the cost function of 4, because it measures the cost of the error; the 
ereater the error, the higher the cost. This method of deriving the cost function 1s sometime referred to 
as the sum of the squared error because it sums up the difference between the predicted value and 
the actual value. This sum is halved as a convenience, as we will see. There are actually two ways 
that we can solve this. We can either use an iterative gradient descent algorithm or minimize the cost 
function in one step using the normal equation. We will look at the gradient descent first. 
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Gradient descent 


When we graph parameter values against the cost function, we get a bowl shaped convex function. As 
parameter values diverge from their optimized values in either direction (from a single minima), the 
cost of our model grows. As the hypothesis function is linear, the cost function is convex. If this was 
not the case, then it would be unable to distinguish between global and local minimum. 


The gradient descent algorithm is expressed by the following update rule: 
O 


repeat until converges =. - o> J 


(ow) 


Where o is the first derivative of J), as it uses the sign of the derivative to determine which way to 


step. This is simply the sign of the slope of the tangent at each point. The algorithm takes a hyper 
parameter, a, which is the learning rate that we need to set. It 1s called a hyper parameter to 
distinguish it from the w parameters that are estimated by our model. If we set the learning rate too 
small, it will take longer to find the minimum; if set too high, 1t will overshoot. We may find that we 
need to run the model several times to determine the best learning rate. 


/ Tangent to wj old 


) w new =w old —a—9— J 
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When we apply gradient descent to linear regression, the following formulas, which are the 
parameters of our model, can be derived. We can rewrite the derivative term to make it easier to 
calculate. The derivations themselves are quite complex, and it is unnecessary to work through them 
here. If you know calculus, you will be able to see that the following rules are equivalent. Here, we 
repeatedly apply two update rules to the hypothesis, employing a stopping function. This 1s usually 
when the differences between the parameters on subsequent iterations drop below a threshold, that 1s, 
l. 


Initialize wy and w7 and repeat: 


wold —wnew] <1} 


My iw, -a—>(h, (x O)- y) 


Wl 3-) 


ii 


| | (i) ayy | 
ww, -a—¥'((h, (x) - y )x 
i Baar | 


\ 
J 


It is important that these update rules are applied simultaneously, that is, they are both applied in the 
same iteration, so the new values of both wg and w7 are plugged back in the next iteration. This 1s 


sometimes called batch gradient descent because it updates all the training samples in one batch. 


It is fairly straightforward to apply these update rules on linear regression problems that have 
multiple features. This 1s true if we do not worry about the precise derivations. 


For multiple features, our hypothesis function will look like this: 


h,, (x)= Ww x= WX, + WX, + WX, te + WX 


nv 


Here, xg = 1, often called our bias feature, is added to help us with the following calculations. We 


see can see that, by using vectors, we can also write this as simply the transpose of the parameter 
values multiplied by the feature value vector, x. With multiple feature gradient descents, our cost 
function will apply to a vector of the parameter values, rather than just a single parameter. This 1s the 
new cost function. 
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J(w)= —>( h, (x )- 9) 


i=] 


J(w) is simply J(wo, w] ...,.Wy), where n is the number of features. / is a function of the parameter 


vector, w. Now, our gradient descent update rule is as follows: 


update w, for j =(0,.... n) Moe > (x = y) lz 
| | 3 | Il i=] 


‘| 


Notice that we now have multiple features. Therefore, we write the x value with the subscript 7 to 


indicate the jth feature. We can break this apart and see that it really represents the 7 + / nested 
update rules. Each one 1s identical, apart from their subscripts, to the training rule that we used for 
single features. 


An important point to mention here, and one that we will revisit in later chapters, 1s that, to make our 
models work more efficiently, we can define our own features. For a simple situation, where our 
hypothesis 1s to estimate the price of a block of land based on two features, width and depth, 
obviously, we can multiply these two features to get one feature, that 1s, area. So, depending on a 
particular insight that you might have about a problem, it can make more sense to use derived 
features. We can take this idea further and create our own features to enable our model to fit nonlinear 
data. A technique to do this 1s polynomial regression. This involves adding power terms to our 
hypothesis function, making it a polynomial. Here is an example: 


h,, (x)= wy +wx+w,x° + Wx 


A way to apply this, in the case of our land price example, is to simply add the square and the cube of 
our area feature. There are many possible choices for these terms, and in fact, a better choice 1n our 
housing example might be in taking the square root of one of the terms to stop the function exploding 
to infinity. This highlights an important point, that is, when using polynomial regression, we must be 
very careful about feature scaling. We can see that the terms in the function get increasingly larger as x 
gets larger. 


We now have a model to fit nonlinear data, however, at this stage, we are just manually trying 


different polynomials. Ideally, we need to be ak\@:togyagorporate feature selection, to some extent, in 
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our models, rather than have a human try to figure out an appropriate function. We also need to be 
aware that correlated features may make our models unstable, so we need to devise ways of 
decomposing correlated features into their components. We look at these aspects in Chapter 7, 
Features — How Algorithms See the World. 


The following is a simple implementation of batch gradient descent. Try running it with different 
values of the learning rate alpha, and on data with a greater bias and/or variance, and also after 
changing the number of iterations to see what effect this has on the performance of our model: 


import numpy as np 
import random 
import matplotlib.pyplot as plt 


def gradientDescent(x, y, alpha, numIterations): 
xTrans = x.transpose() 
m, n = np.shape (x) 
theta = np.ones(n) 
for iin range(0, numIterations): 
hwx = np.dot(x, theta) 


loss = hwx - y 
cost = np.sum(loss ** 2) / (2 * m) 
print("Iteration td | Cost: tf " % (1, cost) ) 


gradient = np.dot(xTrans, loss) / m 
theta = theta - alpha * gradient 
return theta 


def genData(numPoints, bias, variance): 
x = np.zeros (shape=(numPoints, 2) ) 
y = np.zeros (shape=numPoints) 
for iin range(0, numPoints): 


x[i] [0] = 1 
x[i]J [1] = 1 
y[i] = (1 + bias) + random.uniform(0, 1) * variance 


return x, y 


def plotData(x,y,theta): 
plt.scatter(x[...,1],y) 
plt.plot(x[...,1],[theta[0O] + theta[1]*xi for xi in x[...,1]]) 


x, y = genData(20, 25, 10) 

iterations= 10000 

alpha = 0.001 
theta=gradientDescent(x,y,alpha,iterations) 
plotData(x,y,theta) 


The output of the code is as shown in the following screenshot: 
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Iteration 9998 | Cost: 3.271415 
Iteration 9999 | Cost: 3.271138 


a 


40) + 





This 1s called batch gradient descent because, on each iteration, it updates the parameter values 
based on all the training samples at once. With Stochastic gradient descent, on the other hand, the 
gradient is approximated by the gradient of a single example at a time. Several passes may be made 
over the data until the algorithm converges. On each pass, the data is shuffled to prevent it from 
getting stuck in a loop. Stochastic gradient descent has been successfully applied to large scale 
learning problems such as natural language processing. One of the disadvantages is that it requires a 
number of hyper parameters, although this does present opportunities for tweaking such as choosing a 
loss function or the type of regularization applied. Stochastic gradient descent is also sensitive to 
feature scaling. Many implementations of this, such as SGDClassifier and SGDRegressor from the 
Sklearn package, will use an adaptive learning rate by default. This reduces the learning rate as the 
algorithm moves closer to the minimum. To make these algorithms work well, it 1s usually necessary 
to scale the data so that each value in the input vector, _X, 1s scaled between 0 and | or between -1 and 
1. Alternatively, ensure that the data values have a mean of 0 and a variance of 1. This is most easily 
done using the StandardScaler Class from sklearn. preprocessing. 


Gradient descent is not the only algorithm, and in many ways, it is not the most efficient way to 
minimize the cost function. There are a number of advanced libraries that will compute values for the 
parameters much more efficiently than if we implemented the gradient descent update rules manually. 
Fortunately, we do not have to worry too much about the details because there are a number of 
sophisticated and efficient algorithms for regression already written in Python. For example, 1n the 
sklearn.linear model module, there are the Ridge, Lasso, and ElasticNet algorithms that may 
perform better, depending on your application. 
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The normal equation 


Let's now look at the linear regression problem from a slightly different angle. As I mentioned earlier, 
there 1s a numerical solution; thus, rather than iterate through our training set, as we do with gradient 
descent, we can use what 1s called the normal equation to solve it in one step. If you know some 
calculus, you will recall that we can minimize a function by taking its derivative and then setting the 
derivative to zero to solve for a variable. This makes sense because, if we consider our convex cost 
function, the minimum will be where the slope of the tangent is zero. So, in our simple case with one 
feature, we differentiate /(w) with respect to w and set it to zero and solve for w. The problem we are 
interested 1nis when w is ann +/ parameter vector and the cost function, J/(w), 1s a function of this 
vector. One way to minimize this 1s to take the partial derivative of /(w) for the parameter values in 
turn and then set these derivatives to zero, solving for each value of w. This gives us the values of w 
that are needed to minimize the cost function. 


It turns out that an easy way to solve, what could be a long and complicated calculation, 1s what is 
known as the normal equation. To see how this works, we first define a feature matrix, shown as 
follows: 


AL att) |} UAL) 
( ) “ ae 


ani Ay X, hs 

Ae) Az) AZ). Az) 
5 a Ny My 
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This creates anim by n + J matrix, where m 1s the number of training examples, and 7 1s the number of 
features. Notice that, in our notation, we now define our training label vector as follows: 


Now, it turns out that we can compute the parameter values to minimize this cost function by the 


following equation: 
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w= (xX 4 ) , Biel 


This 1s the normal equation. There are of course many ways to implement this in Python. Here 1s one 
simple way using the NumPy matrix class. Most implementations will have a regularization 
parameter that, among other things, prevents an error arising from attempting to transpose a singular 
matrix. This will occur when we have more features than training data, that is, when n is greater than 


m; the normal equation without regularization will not work. This is because the matrix XY is non- 


transposable, and so, there is no way to calculate our term, (XE xy . Regularization has other 
benefits, as we will see shortly: 


import numpy as np 


def normDemo (la=.9): 
X = np.matrix('l 25 ; 1 4 6') 
y=np.matrix('8; 16') 
xtrans=xX.T 
idx=np.matrix(np.identity (X.shape[1]) ) 
xti = (xtrans.dot(X)+tla * idx).I 
xtidt = xti.dot(xtrans) 
return (xtidt.dot(y) ) 


One of the advantages of using the normal equation 1s that you do not need to worry about feature 
scaling. Features that have different ranges (for example, if one feature has values between | and 10, 
and another feature has values between zero and 1000) will likely cause problems for gradient 
descent. Using the normal equation, you do not need to worry about this. Another advantage of the 
normal equation is that you do not need to choose the learning rate. We saw that, with gradient 
descent; an incorrectly chosen learning rate could either make the model unnecessarily slow or, 1f the 
learning rate 1s too large, it can cause the model to overshoot the minimum. This may entail an extra 
step in our testing phase for gradient descent. 


The normal equation has its own particular disadvantages; foremost is that it does not scale as well 
when we have data with a large number of features. We need to calculate the inverse of the transpose 
of our feature matrix, X. This calculation results in ann by n matrix. Remember that n is the number of 
features. This actually means that on most platforms the time it takes to invert a matrix grows, 
approximately, as a cube of n. So, for data with a large number of features, say greater than 10,000, 
you should probably consider using gradient descent rather than the normal equation. Another 
problem that arises when using the normal equation is that, when we have more features than training 
data, that is, when 7 is greater than m, the normal equation without regularization will not work. This 


is because the matrix, Xt, is non-transposable, and so there is no way to calculate our term, 
(XIX)! 
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Logistic regression 


With our least squares model, we have applied it to solve the minimization problem. We can also use 
a variation of this idea to solve classification problems. Consider what happens when we apply 
linear regression to a classification problem. Let's take the simple case of binary classification with 
one feature. We can plot our feature on the x axis against the class labels on the y axis. Our feature 
variable is continuous, but our target variable on the y axis 1s discrete. For binary classification, we 
usually represent a 0 for the negative class, and a / for the positive class. We construct a regression 
line through the data and use a threshold on the y axis to estimate the decision boundary. Here we use 
a threshold of 0.5. 


Feature ae Feature 





In the figure on the left-hand side, where the variance is small and our positive and negative cases are 
well separated, we get an acceptable result. The algorithm correctly classifies the training set. In the 
image on the right-hand side, we have a single outlier in the data. This makes our regression line 
flatter and shifts our cutoff to the right. The outlier, which clearly belongs 1n class /, should not make 
any difference to the model's prediction, however, now with the same cutoff point, the prediction 
misclassifies the first instance of class / as class 0. 


One way that we approach the problem 1s to formulate a different hypothesis representation. For 
logistic regression, we are going use the linear function as an input to another function, g. 


h(xj=e (W x) whereQ<h, <1 


The term g is called the sigmoid or logistic function: You will notice from its graph that, on the y 


w.wowe 


axis, it has asymptotes at zero and one, and it crosses the axis at 0.5. 





Now, 1f we replace the z with we x, We can rewrite our hypothesis function like this: 


| 


(1 +e x] 


ho (xj= 


As with linear regression, we need to fit the parameters, w, to our training data to give us a function 
that can make predictions. Before we try and fit the model, let's look at how we can interpret the 
output from our hypothesis function. Since this will return a number between zero and one, the most 
natural way to interpret this is as it being the probability of the positive class. Since we know, or 
assume, that each sample can only belong in one of two classes, then the probability of the positive 
class plus the probability of the negative class must be equal to one. Therefore, if we can estimate the 
positive class, then we can estimate the probability of the negative class. Since we are ultimately 
trying to predict the class of a particular sample, we can interpret the output of the hypothesis function 
as positive if it returns a value greater than or equal to 0.5, or negative otherwise. Now, given the 
characteristics of the sigmoid function, we can write the following: 


h,=2 ( W x) > 0.5 whenever W* x =0 
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Whenever our hypothesis function, on a particular training sample, returns a number greater than or 
equal to zero, we can predict a positive class. Let's look at a simple example. We have not yet fitted 
our parameters to this model, and we will do so shortly, but for the sake of this example, let's assume 
that we have a parameter vector as follows: 


Our hypothesis function, therefore, looks like this: 
h, (x)= g(-3+4%,+x,) 


We can predict y = / if the following condition 1s met: 


—3+x,+x, 290 


Equivalently: 


4,4+% 23 


This can be sketched with the following graph: 
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This 1s simply a straight line between x=3 and y=3, and it represents the decision boundary. It 
creates two regions where we predict either y = 0 or y = J. What happens when the decision 
boundary 1s not a straight line? In the same way that we added polynomials to the hypothesis function 
in linear regression, we can also do this with logistic regression. Let's write a new hypothesis 
function with some higher order terms to see how we can fit it to the data: 


2 2 
| (x ) = g(™, +W,X, + WX, + WX, + Wx, 


Here we have added two squared terms to our function. We will see how to fit the parameters shortly, 
but for now, let's set our parameter vector to the following: 


= 
0 


w= 0 


So, we can now write the following: 


+ ef 2 2 
Predict y=lif -1+x, +x, 20 
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Or alternatively, we can write this: 


* oT 2 
Predict y=lif x, +x; =] 


This, you may recognize, 1s the equation for a circle centered around the origin, and we can use this 
as our decision boundary. We can create more complex decision boundaries by adding higher order 
polynomial terms. 
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The Cost function for logistic regression 


Now, we need to look at the important task of fitting the parameters to the data. If we rewrite the cost 
function we used for linear regression more simply, we can see that the cost is one half of the squared 
error: 


* 


C'os i ( hh ( x) ; y) = = ( hh (x) — y) 


The interpretation 1s that it 1s simply calculating the cost we want the model to incur, given a certain 
prediction, that is, /,,(x), and a training label, y. 


This will work to a certain extent with logistic regression, however, there is a problem. With logistic 
regression, our hypothesis function 1s dependent on the nonlinear sigmoid function, and when we plot 
this against our parameters, it will usually produce a function that 1s not convex. This means that, 
when we try to apply an algorithm such as gradient descent to the cost function, it will not necessarily 
converge to the global minimum. A solution is to define a cost function that 1s convex, and it turns out 
that the following two functions, one for each class, are suitable for our purposes: 


Cost(h,,(x))=—log(h, (w))if y =1Cost(h,, (x))=-log(1-h, (w))if y =0 


This gives us the following graphs: 





a WOW! eBook _ . 
Intuitively, we can see that this does what wewieedabtedoclt we consider a single training sample in 


the positive class, that is y = /, and if our hypothesis function, /,,(x), correctly predicts /, then the 


cost, as you would expect, is 0. If the output of the hypothesis function is 0, it is incorrect, so the cost 
approaches infinity. When y 1s in the negative class, our cost function is the graph on the right. Here 
the cost is zero when /,,,(x) is 0 and rises to infinity when /,,(x) 1s 7. We can write this in a more 


compact way, remembering that y is either 0 or J: 


Cost (h, (x),¥)=—ylog(h,(x))-(1- y)log(1—A, (x) 


We can see that, for each of the possibilities, y=/ or y=0, the irrelevant term is multiplied by 0, 
leaving the correct term for each particular case. So, now we can write our cost function as follows: 


7 | w) _ = » ylogh., (x 7 (1 — y Jlog (1 — i, (= ) 


So, if we are given a new, unlabeled value of x, how do we make a prediction? As with linear 
regression, our aim is to minimize the cost function, /(w). We can use the same update rule that we 
used for linear regression, that 1s, using the partial derivative to find the slope, and when we rewrite 
the derivative, we get the following: 


J 


Repeat until convergance:W, :=W,-a 2. ( ft. (a = y ” x 
i=l 
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Multiclass classification 


So far, we have just looked at binary classification. For multiclass classification, we assume that 
each instance belongs to only one class. A slightly different classification problem 1s where each 
sample can belong to more than one target class. This 1s called multi-label classification. We can 
employ similar strategies on each of these types of problem. 


There are two basic approaches: 


e One versus all 
e One versus many 


In the one versus all approach, a single multiclass problem is transformed into a number of binary 
classification problems. This is called the one versus all technique because we take each class in 
turn and fit a hypothesis function for that particular class, assigning a negative class to the other 
classes. We end up with different classifiers, each of which 1s trained to recognize one of the classes. 
We make a prediction given a new input by running all the classifiers and picking the classifier that 
predicts a class with the highest probability. To formalize it, we write the following: 

ji (x) for each classi predict probability y =i 


4 


To make a prediction, we pick the class that maximizes the following: 


h' ') ( x) 


With another approach called the one versus one method, a classifier 1s constructed for each pair of 
classes. When the model makes a prediction, the class that receives the most votes wins. This method 
is generally slower than the one versus many method, especially when there are a large number of 
classes. 


All Sklearn classifiers implement multiclass classification. We saw this in Chapter 2, Jools and 
Techniques, with the K-nearest neighbors example, where we attempted to predict one of three 
classes using the iris dataset. Sklearn implements the one versus all algorithm using the 
OneVsRestClassifier Class and the one versus one algorithm with OneVsOneClassifier. These 
are called meta-estimators because they take another estimator as an input. They have the advantage 
of being able to permit changing the way more than two classes are handled, and this can result in 
better performance, either in terms of computational efficiency, or generalization error. 
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In the following example, we use the SVC: 


from sklearn import datasets 
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier 
from sklearn.svm import LinearSVvC 


X,y = datasets.make classification(n samples=10000, n features=5) 
X1,yl = datasets.make classification(n samples=10000, n_ features=5) 
clsAll=OneVsRestClassifier (LinearSVC (random state=0)).f1it(X, y) 
clsOne=OneVsOneClassifier (LinearSVC (random state=0)) .fit(xX1, yl) 
print("One vs all cost= %f" % clsAll.score(X,y) ) 

print("One vs one cost= @f" % clsOne.score(X1,y1) ) 


We will observe the following output: 


One vs all cost= 6.947400 


One vs one cost= 6.949700 
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Regularization 


We mentioned earlier that linear regression can become unstable, that is, highly sensitive to small 
changes in the training data, 1f features are correlated. Consider the extreme case where two features 
are perfectly negatively correlated such that any increase 1n one feature is accompanied by an 
equivalent decrease in another feature. When we apply our linear regression algorithm to just these 
two features, it will result in a function that is constant, so this is not really telling us anything about 
the data. Alternatively, if the features are positively correlated, small changes in them will be 
amplified. Regularization helps moderate this. 


We saw previously that we could get our hypothesis to more closely fit the training data by adding 
polynomial terms. As we add these terms, the shape of the function becomes more complicated, and 
this usually results in the hypothesis overfitting the training data and performing poorly on the test 
data. As we add features, either directly from the data or the ones we derive ourselves, it becomes 
more likely that the model will overfit the data. One approach 1s to discard features that we think are 
less important. However, we cannot know for certain, in advance, what features may contain relevant 
information. A better approach 1s to not discard features but rather to shrink them. Since we do not 
know how much information each feature contains, regularization reduces the magnitude of all the 
parameters. 


We can simply add the term to the cost function. 


1. = ES (i (e)-2) dow 


oni , jal 


The hyper parameter, lambda, controls a tradeoff between two goals—the need to fit the training data, 
and the need to keep the parameters small to avoid overfitting. We do not apply the regularization 
parameter to our bias feature, so we separate the update rule for the first feature and add a 
regularization parameter to all subsequent features. We can write it like this: 


Xq 


Repeat until convergance< w. = Ww =-a— ah (x Oo) yp ‘a Ai) 


Ww. =Ww.-a— Sia (x! =. y) - " jaw, 
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Here, we have added our regularization term, Weide hO,8ge more clearly how this works, we can 


eroup all the terms that depend on wy, and our update rule can be rewritten as follows: 


: A : | WW es . 
. + 4, 4 — " — - wll} as Ai) Aa) 
w= Ww, [ a— |l-a— ; (h, (x y )x\ 
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The regularization parameter, A, is usually a small number greater than zero. In order for it to have the 
desired effect, it is set such that @ 4 /m is a number slightly less than /. This will shrink w; on each 


iteration of the update. 


Now, let's see how we can apply regularization to the normal equation. The equation 1s as follows: 


‘io (X7X + AI) | xy 


This 1s sometimes referred to as the closed form solution. We add the identity matrix, /, multiplied by 
the regularization parameter. The identity matrix is an (n+1) by (n+) matrix consisting of ones on the 
main diagonal and zeros everywhere else. 


In some implementations, we might also make the first entry, the top-left corner, of the matrix zero 
reflect the fact that we are not applying a regularization parameter to the first bias feature. However, 
in practice, this will rarely make much difference to our model. 


When we multiply it with the identity matrix, we get a matrix where the main diagonal contains the 
value of A, with all other positions as zero. This makes sure that, even if we have more features than 


training samples, we will still be able to invert the matrix XY. It also makes our model more stable 
if we have correlated variables. This form of regression 1s sometimes called ridge regression, and 
we saw an implementation of this in Chapter 2, Jools and Techniques. An interesting alternative to 
ridge regression 1s lasso regression. It replaces the ridge regression regularization term, ) iwi 2, with 
>i | wi |. That is, instead of using the sum of the squares of the weights, it uses the sum of the average 
of the weights. The result 1s that some of the weights are set to 0 and others are shrunk. Lasso 
regressions tends to be quite sensitive to the regularization parameter. Unlike ridge regression, lasso 
regression does not have a closed-form solution, so other forms of numerical optimization need to be 
employed. Ridge regression is sometimes referred to as using the L2 norm, and lasso regularization, 
the L1 norm. 


Finally, we will look at how to apply regularization to logistic regression. As with linear regression, 
logistic regression can suffer from the same problems of overfitting if our hypothesis functions 
contain higher-order terms or many features. We can modify our logistic regression cost function to 


add the regularization parameter, as shownas IQHOWe ek org 


J.=- “ > »"logh, [=] + (1 —y log (1 —h, Fox ) + . Ds Ww, 


To implement gradient descent for logistic regression, we end up with an equation that, on the surface, 
looks identical to the one we used for gradient descent for linear regression. However, we must 
remember that our hypothesis function is the one we used for logistic regression. 


l mi ne : : A 
wW.t=w.-a— - (/ (x — yi? )x? + —W. 
} : Il i=] " fo J J 


Hl 


Using the hypothesis function, we get the following: 


h. (x) = ie") 
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Summary 


In this chapter, we studied some of the most used techniques 1n machine learning. We created 
hypothesis representations for linear and logistic regression. You learned how to create a cost 
function to measure the performance of the hypothesis on training data, and how to minimize the cost 
function in order to fit the parameters, using both gradient descent and the normal equation. We 
showed how you could fit the hypothesis function to nonlinear data by using polynomial terms in the 
hypothesis function. Finally, we looked at regularization, its uses, and how to apply it to logistic and 
linear regression. 


These are powerful techniques used widely in many different machine learning algorithms. However, 
as you have probably realized, there 1s a lot more to the story. The models we have looked at so far 
usually require considerable human intervention to get them to perform usefully. For example, we 
have to set the hyper parameters, such as the learning rate or regularization parameter, and, in the case 
of non linear data, we have to try and find polynomial terms that will force our hypothesis to fit the 
data. It will be difficult to determine exactly what these terms are, especially when we have many 
features. In the next chapter, we will look at the ideas that drive some of the most powerful learning 
algorithms on the planet, that is, neural networks. 
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Chapter 6. Neural Networks 


Artificial neural networks, as the name suggests, are based algorithms that attempt to mimic the way 
neurons work in the brain. Conceptual work began in the 1940s, but 1t is only somewhat recently that 
a number of important insights, together with the availability of hardware to run these more 
computationally expensive models, have given neural networks practical application. They are now 
state-of-the-art techniques that are at the heart of many advanced machine learning applications. 


In this chapter, we will introduce the following topics: 
Logistic units 
The cost function for neural networks 


Implementing a neural network 
Other neural network architectures 
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Getting started with neural networks 


We saw 1n the last chapter how we could create a nonlinear decision boundary by adding polynomial 
terms to our hypothesis function. We can also use this technique in linear regression to fit nonlinear 
data. However, this is not the ideal solution for a number of reasons. Firstly, we have to choose 
polynomial terms, and for complicated decision boundaries, this can be an imprecise and time- 
intensive process, which can take quite a bit of trial and error. We also need to consider what 
happens when we have a large number of features. It becomes difficult to understand exactly how 
added polynomial terms will change the decision boundary. It also means that the possible number of 
derived features will grow exponentially. To fit complicated boundaries, we will need many higher- 
order terms, and our model will become unwieldy, computationally expensive, and hard to 
understand. 


Consider applications such as computer vision, where in a gray scale image, each pixel is a feature 
that has a value between 0 and 255. For a small image, say 100 pixels by 100 pixels, we have 10,000 
features. If we include just quadratic terms, we end up with around 50 million possible features, and 
to fit complex decision boundaries, we likely need cubic and higher order terms. Clearly, such a 
model is entirely unworkable. 


When we approach the problem of trying to mimic the brain, we are faced with a number of 
difficulties. Considering all the different things that the brain does, we might first think that the brain 
consists of a number of different algorithms, each specialized to do a particular task, and each hard 
wired into different parts of the brain. This approach basically considers the brain as a number of 
subsystems, each with its own program and task. For example, the auditory cortex for perceiving 
sound has its own algorithm that, for example, does a Fourier transform on the incoming sound wave 
to detect pitch. The visual cortex, on the other hand, has its own distinct algorithm for decoding and 
converting the signals from the optic nerve into the sense of sight. There is, however, growing 
evidence that the brain does not function like this at all. 


Recent experiments on animals have shown the remarkable adaptabilities of brain tissue. Rewiring 
the optic nerve to the auditory cortex 1n animals, scientists found that the brain could learn to see 
using the machinery of the auditory cortex. The animals were tested to have full vision despite the fact 
that their visual cortex had been bypassed. It appears that brain tissue, 1n different parts of the brain, 
can relearn how to interpret its inputs. So, rather than the brain consisting of specialized subsystems 
programmed to perform specific tasks, it uses the same algorithm to learn different tasks. This single 
algorithm approach has many advantages, not least of which is that it is relatively easy to implement. 
It also means that we can create generalized models and then train them to perform specialized tasks. 
Like in real brains using a single algorithm to describe how each neuron communicates with the other 
neurons around it, it allows artificial neural networks to be adaptable and able to carry out multiple 
higher-level tasks. But, what 1s the nature of this single algorithm? 


When trying to mimic real brain functions, we are forced to greatly simplify many things. For 


example, there is no way to take into account the Wot8wkthe chemical state of the brain, or the state of 
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the brain at different stages of development and growth. Most of the neural net models currently in use 
employ discrete layers of artificial neurons, or units, connected in a well ordered linear sequence or 
in layers. The brain, on the other hand, consists of many complex, nested, and interconnected neural 
circuits. Some progress has been made in attempting to imitate these complex feedback systems, and 
we will look at these at the end of this chapter. However, there is still much that we do not know 
about real brain action and how to incorporate this complex behavior into artificial neural networks. 
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Logistic units 


As a Starting point, we use the idea of a logistic unit over the simplified model of a neuron. It consists 
ofa set of inputs and outputs and an activation function. This activation function is essentially 
performing a calculation on the set of inputs, and subsequently giving an output. Here, we set the 
activation function to the sigmoid that we used for logistic regression 1n the previous chapter: 





We have Two input units, x} and x9 and a bias unit, xo, that 1s set to one. These are fed into a 


hypothesis function that uses the sigmoid logistic function and a weight vector, w, which 
parameterizes the hypothesis function. The feature vector, consisting of binary values, and the 
parameter vector for the preceding example consist of the following: 


t= Wi 
x=x, W=W, 
Be W, 

x; W, 


To see how we can get this to perform logical functions, let's give the model some weights. We can 


write this as a function of the sigmoid, g, and OUoWIEMS, JO get started, we are just going to choose 


some weights. We will learn shortly how to train the model to learn its own weights. Let's say that we 
set out weight such that we have the following hypothesis function: 


h,,(x)= g(-15 +10, +10, | 


We feed our model some simple labeled data and construct a truth table: 


x Aa PD h(x) 

0 0 1 g(-l 5) = Q 
0 QO g (—5) = Q 
| & DY g (—5) ~ Q 
tr & 4 g(5)~1 


Although this data appears relatively simple, the decision boundary that is needed to separate the 
classes is not. Our target variable, y, forms the logical XNOR with the input variables. The output 1s 
I only when both x7 and x7 are either 0 or J. 


Here, our hypothesis has given us a logical AND. That is, it returns a / when both x7 and x7 are /. By 


setting the weights to other values, we can get our single artificial neuron to form other logical 
functions. 


This gives us the logical OR function: 
h, =-5+10x, +10x, 


To perform an XNOR, we combine the AND, OR, and NOT functions. To perform negation, that is, a 
logical NOT, we simply choose large negative weights for the input variable that we want to negate. 


Logistics units are connected together to form artificial neural networks. These networks consist of an 
input layer, one or more hidden layers, and an output layer. Each unit has an activation function, here 
the sigmoid, and is parameterized by the weight matrix W: 
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We can write out the activation functions for each of the units in the hidden layer: 


a =6 [Wis + Le 7 Was He 7 WS) x, 


20 


— 


a?) = g(Wy)x, +Wy)x, + Wax, +Wy)x; ) 


ay) =~ (Was, + Wx, + Wx, o Ws x; 


The activation function for the output layer is as follows: 


hh (x) = a” = | ae? af War?) =f Way? + Wis ay”) 


More generally, we can say a function mapping from a given layer, /, to the layer j+/ 1s determined by 
the parameter matrix, Wj. The super script 7 represents the jth layer, and the subscript, 7, denotes the 


unit in that layer. We denote the parameter or weight matrix, wh) , which governs the mapping from 
the layer j to the layer 7 + /. We denote the individual weights in the subscript of their matrix index. 


Note that the dimensions of the parameter matrix for each layer will be the number of units in the next 
layer multiplied by the number of units in the current layer plus /; this is for x7, which is the bias 


layer. More formally, we can write the dimens{6i" parameter matrix for a given layer, j, as 
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follows: 


d 


di iat) *4, +] 


The subscript (7 + J) refers to the number of units in the next input layer and the forward layer, and the 
dj + J refers to the number of units in the current layer plus /. 


Let's now look at how we can calculate these activation functions using a vector implementation. We 
can write these functions more compactly by defining a new term, Z, which consists of the weighted 
linear combination of the input values for each unit on a given layer. Here is an example: 


a 7(2) 
ay =8(Z, 


We are just replacing everything in the inner term of our activation function with a single function, Z. 
Here, the super script (2) represents the layer number, and the subscript / indicates the unit in that 
layer. So, more generally, the matrix that defines the activation function for the layer 7 is as follows: 


eels) 
a Z, 
VAC - FV) 


= a) 


So, 1n our three layer example, our output layer can be defined as follows: 


iixj=a = g(z(3)) 


We can learn features by first looking at just the three units on the single hidden layer and how it maps 
its input to the input of the single unit on the output layer. We can see that it 1s only performing logistic 


regression using the set of features (a? ). The difference is that now the input features of the hidden 


layer have themselves been computed using theouecrglatsilearned from the raw features at the input 
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layer. Through hidden layers, we can start to fit more complicated nonlinear functions. 


We can solve our XNOR problem using the following neural net architecture: 





Here, we have three units on the input layer, two units plus the bias unit on the single hidden layer, 
and one unit on the output layer. We can set the weights for the first unit in the hidden layer (not 
including the bias unit) to perform the logical function x 7 AND x7. The weights for the second unit 


perform the functions (VOT x 7) AND (NOT x 3). Finally, our output layer performs the OR function. 
We can write our activation functions as follows: 
“et — <7 rea, 
a,’ = g(—-15x, + 10x, +10x, ) 
ah?) — «llr 2 
a,’ = g (10x, + 20x, — 20x, ) 
ae 5 ‘ : 
a” = g(-S5x, +10x, +10x, ) 


The truth table for this network looks like this: 
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x, % al 2) ah. (x) 
0 oO g | | 

0 ££ @ @ 0 

i Oo 2 () 0 

i ft | 0) | 


To perform multiclass classification with neural networks, we use architectures with an output unit 
for each class that we are trying to classify. The network outputs a vector of binary numbers with / 
indicating that the class is present. This output variable 1s ani dimensional vector, where i is the 
number of output classes. The output space for four features, for example, would look like this: 


, 4 ™ ~ 
y y2) ye yl 


So Go © - 
SS fe — 


Our goal is to define a hypothesis function to approximately equal one of these four vectors: 


h, (x)* yy”? 


This 1s essentially a one versus all representation. 


We can describe a neural network architecture by the number of layers, L, and by the number of units 
in each layer by a number, s;, where the subscript indicates the layer number. For convenience, I am 


going to define a variable, ¢, indicating the number of units on the layer / + /, where / + J is the 
forward layer, that is, the layer to the right-hand side of the diagram. 
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Cost function 


To fit the weights in a neural net for a given training set, we first need to define a cost function: 


J = —y 3 y, log (h, [a )) + (I — yy’ log ( - (h, x? )) | + < my i . (Wy ) 


=| F=|] f=] y=] j=l 


This 1s very similar to the cost function we used for logistic regression, except that now we are also 
summing over é& output units. The triple summation used 1n the regularization term looks a bit 
complicated, but all it 1s really doing is summing over each of the terms in the parameter matrix, and 
using this to calculate the regularization. Note that the summation, 7, /, and / start at 7, rather than 0; 
this 1s to reflect the fact that we do not apply regularization to the bias unit. 


WOW! eBook 
www.wowebook.org 


Minimizing the cost function 


Now that we have cost function, we need to work out a way to minimize it. As with gradient descent, 
we heed to compute the partial derivatives to calculate the slope of the cost function. This is done 
using the back propagation algorithm. It is called back propagation because we begin by calculating 
the error at the output layer, then calculating the error for each previous layer in turn. We can use 
these derivatives calculated by the cost function to work out parameter values for each of the units in 
our neural network. To do this, we need to define an error term: 


(1 : i 
0. | = error of node jin layer'| 


For this example, let's assume that we have a total of three layers, including the input and output 
layers. The error at the output layer can be written as follows: 


a0 A ak 
é, =a, y,= hh (x) y, 


The activation function in the final layer is equivalent to our hypothesis function, and we can use 
simple vector subtraction to calculate the difference between the values predicted by our hypothesis, 
and the actual values in our training set. Once we know the error in our output layer, we are able to 
back propagate to find the error, which is the delta values, in previous layers: 


2 am i fF af 3) oa 
av} = (Ww?) 5”) * 9 (2) 


This will calculate the error for layer three. We use the transpose of the parameter vector of the 
current layer, in this example layer 2, multiplied by the error vector from the forward layer, in this 
case layer 3. We then use pairwise multiplication, indicated by the * symbol, with the derivative of 


the activation function, g, evaluated at the input values given by z(3) We can calculate this derivative 
term by the following: 


g'(2”) = al) «(I — a”) 
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If you know calculus, it is a fairly straight forward procedure to prove this, but for our purposes, we 
will not go into it here. As you would expect when we have more than one hidden layer, we can 
calculate the delta values for each hidden layer in exactly the same way, using the parameter vector, 
the delta vector for the forward layer, and the derivative of the activation function for the current 
layer. We do not need to calculate the delta values for layer 1 because these are just the features 
themselves without any errors. Finally, through a rather complicated mathematical proof that we will 
not go into here, we can write the derivative of the cost function, ignoring regularization, as follows: 


~ 


C 


[a wy? | 


J(W)=« we _ 


By computing the delta terms using back propagation, we can find these partial derivatives for each of 
the parameter values. Now, let's see how we apply this to a dataset of training samples. We need to 
define capital delta, 7, which is just the matrix of the delta terms and has the dimensions, /7i:7. This 
will act as an accumulator of the delta values from each node in the neural network, as the algorithm 
loops through each training sample. Within each loop, it performs the following functions on each 
training sample: 


1. It sets the activation functions in the first layer to each value of x, that 1s, our input features. 

2. It performs forward propagation on each subsequent layer in turn up to the output layer to 
calculate the activation functions for each layer. 

3. It computes the delta values at the output layer and begins the process of back propagation. This 
is similar to the process we performed in forward propagation, except that it occurs 1n reverse. 
So, for our output layer in our 3-layer example, it is demonstrated as follows: 


5) = aq? — 


Remember that this 1s all happening in a loop, so we are dealing with one training sample at a time; 


yl) represents the target value of the jth training sample. We can now use the back propagation 
algorithm to calculate the delta values for previous layers. We can now add these values to the 
accumulator, using the update rule: 


/ / H) o(/+l 
A =A, 4 Ce hl 


(iF ) (i) 


This formula can be expressed in its vectorizetO@raAt at all training samples at once, as shown: 
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A =A +5") (a) 


Now, we can add our regularization term: 


AW) — Al) 4 ah 


Finally, we can update the weights by performing gradient descent: 


yw) = Ww) ~ah\? 


Remember that a@ is the learning rate, that is, a hyper parameter we set to a small number between 0 
and |. 
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Implementing a neural network 


There 1s one more thing we need to consider, and that is the initialization of our weights. If we 
initialize them to O, or all to the same number, all the units on the forward layer will be computing the 
same function at the input, making the calculation highly redundant and unable to fit complex data. In 
essence, what we need to do is break the symmetry so that we give each unit a slightly different 
starting point that actually allows the network to create more interesting functions. 


Now, let's look at how we might implement this in code. This implementation is written by Sebastian 
Raschka, taken from his excellent book, Python Machine Learning, released by Packt Publishing: 


import numpy as np 
from scipy.special import expit 
import sys 


class NeuralNetMLP (object) : 


def init (self, n_ output, n features, n_hidden=30, 
11=0.0, 12=0.0, epochs=500, eta=0.001, 
alpha=0.0, decrease const=0.0, shuffle=True, 
minibatches=1, random_state=None) : 


np.random.seed (random state) 
self.n output = n_output 


self.n features = n features 
self.n_ hidden = n_hidden 
self.wl, self.w2 = self. initialize weights () 


self.11= 11 

self.12 = 12 

self.epochs = epochs 

self.eta = eta 

self.alpha = alpha 

self.decrease const = decrease const 
self.shuffle = shuffle 
self.minibatches = minibatches 


def encode labels(self, y, k): 


onehot = np.zeros((k, y.shape[0]) ) 

for idx, val in enumerate (y): 
onehot[val, idx] = 1.0 

return onehot 


def initialize weights (self): 
"""TInitialize weights with small random numbers.""" 
wl = np.random.uniform(-1.0, 1.0, size=self.n hidden*(self.n features + 
1)) 
wl 
w2 


wl.reshape(self. n_hiddepnpysesd jn_features + 1) 
np .random.uniform(-1 \Qww.WoWebsdizagself.n_output* (self.n_ hidden + 


1) ) 
w2 = w2.reshape(self.n output, self.n hidden + 1) 
return wl, w2 


def sigmoid(self, 2z): 


# return 1.0 / (1.0 + np.exp(-z) ) 
return expit(z) 


def sigmoid gradient(self, z): 
sg = self. sigmoid (z) 
return sg * (1 - sg) 


def add bias unit(self, X, how='column'): 
1f how == 'column' : 


X new = np.ones((X.shape[0], X.shape[1]+1) ) 
X new[:, 1:] = X 


elif how == 'row': 
X new = np.ones((X.shape[0]+1, X.shape[1]) ) 
X new[1:, :] =X 

else: 


raise AttributeError(' how’ must be ‘column’ or ‘row’ ') 
return X new 


def feedforward(self, X, wl, w2): 


al = self. add bias unit(X, how='column' ) 
Z2 = wl.dot(al.T) 

a2 = self. sigmoid (z2) 

a2 = self. add bias unit(a2, how='row') 
zZ3 = w2.dot(a2) 


a3 = self. sigmoid(z3) 
return al, z2, a2, 23, a3 


def L2 reg(self, lambda_, wl, w2): 
"""Compute L2-regularization cost""" 
return (lambda /2.0) * (np.sum(wl[:, 1:] ** 2) + np.sum(w2[:, 1:] ** 2)) 


def Ll reg(self, lambda_, wl, w2): 
"""Compute Ll-regularization cost""" 
return (lambda /2.0) * (np.abs(wl[:, 1:]).sum() + np.abs(w2[:, 
1:]).sum() ) 


def get _cost(self, y_ enc, output, wl, w2): 


terml = -y_ enc * (np.log(output) ) 
term2 = (1 - y_enc) * np.log(1l - output) 
cost = np.sum(terml - term2) 


Lil term = self. Ll reg(self.11, wl, w2) 

L2 term = self. L2 reg(self.12, wl, w2) 

cost = cost + Ll _ term + L2_ tex eBook 
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def 


def 


def 


return cost 
_get_gradient(self, al, a2, a3, 22, y_enc, wl, w2): 


# backpropagation 

sigma3 = a3 - y enc 

z2 = self. add bias unit(z2, how='row') 

Ssigma2 = w2.T.dot(sigma3) * self. sigmoid gradient (z2) 
Sigma2 = sigma2[1:, :] 

gradl = sigma2.dot(al) 

grad2 = sigma3.dot(a2.T) 


# regularize 
gradl[:, 1:] += (wl[:, 1:] * (self.11 + self.12) ) 
grad2[:, 1:] += (w2[:, 1:] * (self.11 + self.12) ) 


return gradl, grad2 
predict(self, X): 


if len(X.shape) != 2: 
raise AttributeError('X must be a [n samples, n_ features] array. \n' 
'Use X[:,None] for 1-feature classification, ' 
'\nor X[[i]] for 1-sample classification ' ) 


al, z2, a2, z3, a3 = self. feedforward(X, self.wl, self.w2) 
y pred = np.argmax(z3, axis=0) 
return y pred 


fit(self, X, y, print progress=False) : 
self.cost = [] 


X data, y data = X.copy(), y.copy() 
y_enc = self. encode labels(y, self.n_ output) 


delta wl prev np.zeros (self.wl.shape) 
delta w2 prev = np.zeros(self.w2.shape) 


for iin range(self.epochs) : 


# adaptive learning rate 
self.eta /= (1 + self.decrease_ const*i) 


if print progress: 
sys.stderr.write('\rEpoch: %d/%d' % (i+1, self.epochs) ) 
sys.stderr.flush() 


1f self.shuffle: 
idx = np.random.permutation (y data.shape[0]) 
X data, y data = X data[idx], y data[idx] 


mini = np.array split (range(ypdata.shape[0]), self.minibatches) 
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for idx in mini: 


# feedforward 
al, z2, a2, z3, a3 = self. feedforward (X[idx], self.wl, self.w2) 
cost = self. get _ cost(y enc=y enc[:, idx], 
output=a3, 
wl=self.wl, 
w2=self.w2) 
self.cost_.append (cost) 


# compute gradient via backpropagation 

gradl, grad2 = self. get gradient(al=al, a2=a2, 
a3=a3, z2=Z2, 
y_enc=y enc[:, idx], 
wl=self.wl, 
w2=self.w2) 


delta wl, delta w2 = self.eta * gradi, self.eta * grad2 
self.wl -= (delta _ wl + (self.alpha * delta wl prev) ) 
self.w2 -= (delta _w2 + (self.alpha * delta w2 prev) ) 
delta wl prev, delta w2 prev = delta wl, delta w2 


return self 


Now, let's apply this neural net to the iris sample dataset. Remember that this dataset contains three 
classes, so we set the n output parameter (the number of output layers) to 3. The shape of the first 
axis 1n the dataset refers to the number of features. We create 50 hidden layers and 100 epochs, with 
each epoch being a complete loop over all the training set. Here, we set the learning rate, alpha, to 
.001, and we display a plot of the cost against the number of epochs: 


iris = datasets.load iris () 

X=iris.data 

y=1iris.target 

nn= NeuralNetMLP(3, X.shape[1],n_hidden=50, epochs=100, alpha=.001) 
nn.f1t(X,y) 

plt.plot(range(len(nn.cost_)),nn.cost_) 

plt.show () 


Here is the output: 
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The graph shows how the cost is decreasing on each epoch. To get a feel for how the model works, 
spend some time experimenting with it on other data sets and with a variety of input parameters. One 
particular data set that is used often when testing multiclass classification problems is the MNIST 
dataset, which 1s available at http://yann.lecun.com/exdb/mnist/. This consists of datasets with 60,000 
images of hand drawn letters, along with their labels. It is often used as a benchmark for machine 
learning algorithms. 
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Gradient checking 


Back propagation, and neural nets 1n general, are a little difficult to conceptualize. So, 1t is often not 
easy to understand how changing any of the model (hyper) parameters will affect the outcome. 
Furthermore, with different implementations, it is possible to get results that indicate that an algorithm 
is working correctly, that is, the cost function is decreasing on each level of gradient descent. 
However, as with any complicated software, there can be hidden bugs that might only manifest 
themselves under very specific conditions. A way to help eliminate these is through a procedure 
called gradient checking. This 1s a numerical way of approximating gradients, and we can 
understand this intuitively by examining the following diagram: 


J (w+e}—J(w—e}) 





The derivative of /(w), with respect to w, can be approximated as follows: 


3 oe (J (w+ e)-—J(w- e)) 
eae ( w) o~ 


dw 


The preceding formula approximates the derivative when the parameter is a single value. We need to 
evaluate these derivatives ona cost function, where the weights are a vector. We do this by 
performing a partial derivative on each of the weights 1n turn. Here 1s an example: 


WOW! eBook 
www.wowebook.org 


w= lw, SL ae | 


a J ( w) = J ( Ww + =, Ww, Greist Ww ) send J ( ae . Ww, s/n ia'y WwW, ; 
JW, 
O 


(a J (w) — mf ( W ; W;, nuns Yt = = (w, W, wines Ww, ‘nis E) 
Vw ) 
, i 
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Other neural net architectures 


Much of the most important work being done in the field of neural net models, and indeed machine 
learning in general, 1s using very complex neural nets with many layers and features. This approach 1s 
often called deep architecture or deep learning. Human and animal learning occurs at a rate and 
depth that no machine can match. Many of the elements of biological learning still remain a mystery. 
One of the key areas of research, and one of the most useful in practical applications, is that of object 
recognition. This is something quite fundamental to living systems, and higher animals have evolved 
an extraordinary ability to learn complex relationships between objects. Biological brains have many 
layers; each synaptic event exists in a long chain of synaptic processes. In order to recognize complex 
objects, such as people's faces or handwritten digits, a fundamental task that 1s needed is to create a 
hierarchy of representation from the raw input to higher and higher levels of abstraction. The goal 1s 
to transform raw data, such as a set of pixel values, into something we can describe as, say, a person 
riding bicycle. An approach to solving these sorts of problems 1s to use a sparse representation that 
creates higher dimensional feature spaces, where there are many features, but only very few of them 
have non-zero values. This approach is attractive for several reasons. Firstly, features may become 
more linearly separable in higher feature spaces. Also, 1t has been shown in certain models that 
sparsity can be used to make training more efficient and help extract information from very noisy data. 
We will explore this idea and the general concept of feature extraction in greater detail 1n the next 
chapter. 


Another interesting idea 1s that of recurrent neural networks or RNNs. These are in many ways 
quite distinct from the feed forward networks that we have considered so far. Rather than simply 
static mappings between input and output, RNNs have at least one cyclic feedback path. RNNs 
introduce a time component to the network because a unit's input may include inputs that it received 
earlier via a feedback loop. All biological neural networks are highly recurrent. Artificial RNNs 
have shown promise in areas such as speech and hand writing recognition. However, they are, in 
general, much harder to train because we cannot simply back propagate the error. We have to take into 
consideration the time component and the dynamic, nonlinear characteristics of such systems. RNNs 
will provide a very interesting area for future research. 
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Summary 


In this chapter, we introduced the powerful machine learning algorithms of artificial neural networks. 
We saw how these networks are a simplified model of neurons 1n the brain. They can perform 
complex learning tasks, such as learning highly nonlinear decision boundaries, using layers of 
artificial neurons, or units, to learn new features from labelled data. In the next chapter, we will look 
at the crucial component of any machine learning algorithm, that is, its features. 
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Chapter 7. Features — How Algorithms See the 
World 


So far in this book, we suggested a number of ways and a number of reasons for creating, extracting, 
or, otherwise, manipulating features. In this chapter, we will address this topic head on. The right 
features, sometimes called attributes, are the central component for machine learning models. A 
sophisticated model with the wrong features is worthless. Features are how our applications see the 
world. For all but the most simple tasks, we will process our features before feeding them to a model. 
There are many interesting ways in which we can do this, and it is such an important topic that it's 
appropriate to devote an entire chapter to it. 


It has only been in the last decade or so that machine learning models have been routinely using tens 
of thousands of features or more. This allows us to tackle many different problems, such as those 
where our feature set 1s large compared to the number of samples. Two typical applications are 
genetic analysis and text categorization. For genetic analysis, our variables are a set of gene 
expression coefficients. These are based on the number of mRNA present 1n a sample, for example, 
taken from a tissue biopsy. A classification task might be performed to predict whether a patient has 
cancer or not. The number of training and test samples together may be a number less than 100. On the 
other hand, the number of variables 1n the raw data may range from 6,000 to 60,000. Not only will 
this translate to a large number of features, it also means that the range of values between features 1s 
quite large too. In this chapter, we will cover the following topics: 


Feature types 

Operations and statistics 
Structured features 
Transforming features 
Principle component analysis 
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Feature types 


There are three distinct types of features: quantitative, ordinal, and categorical. We can also consider 
a fourth type of feature—the Boolean—as this type does have a few distinct qualities, although it is 
actually a type of categorical feature. These feature types can be ordered 1n terms of how much 
information they convey. Quantitative features have the highest information capacity followed by 
ordinal, categorical, and Boolean. 


Let's take a look at the tabular analysis: 


Quantitative Range, variance, and standard deviation/Skewness, kurtosis 


The preceding table shows the three types of features, their statistics, and properties. Each feature 
inherits the statistics from the features from the next row it 1n the table. For example, the measurement 
of central tendency for quantitative features includes the median and mode. 
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Quantitative features 


The distinguishing characteristic of quantitative features is that they are continuous, and they usually 
involve mapping them to real numbers. Often, feature values can be mapped to a subset of real 
numbers, for example, expressing age in years; however, care must be taken to use the full scale when 
calculating statistics, such as mean or standard deviation. Because quantitative features have a 
meaningful numeric scale, they are often used in geometric models. When they are used 1n tree 
models, they result 1n a binary split, for example, using a threshold value where values above the 
threshold go to one child and values equal to or below the threshold go to the other child. Tree 
models are insensitive to monotonic transformations of scale, that is, transformations that do not 
change the ordering of the feature values. For example, it does not matter to a tree model if we 
measure length in centimeters or inches, or use a logarithmic or linear scale, we simply have to 
change the threshold values to the same scale. Tree models ignore the scale of quantitative features 
and treat them as ordinal. This is also true for rule-based models. For probabilistic models, such as 
the naive Bayes classifier, quantitative features need to be discretized into a finite number of bins, 
and therefore, converted to categorical features. 
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Ordinal features 


Ordinal features are features that have a distinct order but do not have a scale. They can be encoded 
as integer values; however, doing so does not imply any scale. A typical example is that of house 
numbers. Here, we can discern the position of a house on a street by its number. We assume that house 
number | will come before house number 20 and that houses with the numbers 10 and 11 would be 
located close to each other. However, the size of the number does not imply any scale; for example, 
there 1s no reason to believe that house number 20 will be larger than house number 1. The domain of 
an ordinal feature is a totally ordered set such as a set of characters or strings. Because ordinal 
features lack a linear scale, it does not make sense to add or subtract them; therefore, operations such 
as averaging ordinal features do not usually make sense or yield any information about the features. 
Similar to quantitative features in tree models, ordinal features result in a binary split. In general, 
ordinal features are not readily used in most geometric models. For example, linear models assume a 
Euclidian instance space where feature values are treated as Cartesian coordinates. For distance- 
based models, we can use ordinal features if we encode them as integers and the distance between 
them 1s simply their difference. This is sometimes referred to as the hamming distance. 
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Categorical features 


Categorical features, sometimes called nominal features, do not have any ordering or scale, and 
therefore, they do not allow any statistical summary apart from the mode indicating the most frequent 
occurrence of a value. Categorical features are often best handled by probabilistic models; however, 
they can also be used in distance-based models using the hamming distance and by setting the distance 
to 0 for equal values and | for unequal values. A subtype of categorical features 1s the Boolean 
feature, which maps into the Boolean values of true or false. 
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Operations and statistics 


Features can be defined by the allowable operations that can be performed on them. Consider two 
features: a person's age and their phone number. Although both these features can be described by 
integers, they actually represent two very different types of information. This 1s clear when we see 
which operations we can usefully perform on them. For example, calculating the average age ofa 
eroup of people will give us a meaningful result; calculating the average phone number will not. 


We can call the range of possible calculations that can be performed on a feature as its statistics. 
These statistics describe three separate aspects of data. These are—its central tendency, its 
dispersion, and its shape. 


To calculate the central tendency of data, we usually use one or more of the following statistics: the 
mean (or average), the median (or the middle value in an ordered list), and the mode (or the majority 
of all values). The mode is the only statistic that can be applied to all data types. To calculate the 
median, we need feature values that can be somehow ordered, that is ordinal or quantitative. To 
calculate the mean, values must be expressed on some scale, such as the linear scale. In other words 
they will need to be quantitative features. 


The most common way of calculating dispersion 1s through the statistics of variance or standard 
deviation. These are both really the same measure but on different scales, with standard deviation 
being useful because it 1s expressed on the same scale as the feature itself. Also, remember that the 
absolute difference between the mean and the median 1s never larger than the standard deviation. A 
simpler statistic for measuring dispersion is the range, which 1s just the difference between the 
minimum and maximum values. From here, of course, we can estimate the feature's central tendency 
by calculating the mid-range point. Another way to measure dispersion 1s using units such as 
percentiles or deciles to measure the ratio of instances falling below a particular value. For example, 


the pil percentile 1s the value that p percent of instances fall below. 


Measuring shape statistics 1s a little more complicated and can be understood using the idea of the 
central moment of a sample. This is defined as follows: 


mM, = — ( x, _ uy 


Mj 


Here, 7 is the number of samples, jz is the sample mean, and é is an integer. When & = J, the first 
central moment is 0 because this 1s simply the average deviation from the mean, which is always 0. 
The second central moment is the average squared deviation from the mean, which is the variance. 


We can define skewness as follows: 
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Here ois the standard deviation. If this formula gives a value that is positive, then there are more 
instances with values above the mean rather than below. The data, when graphed, 1s skewed to the 
right. When the skew 1s negative, the converse 1s true. 


We can define kurtosis as a similar relationship for the fourth central moment: 


mM, 


o* 


It can be shown that a normal distribution has a kurtosis of 3. At values above this, the distribution 
will be more peaked. At kurtosis values below 3, the distribution will be flatter. 


We previously discussed the three types of data, that is, categorical, ordinal, and quantitative. 


Machine learning models will treat the different data types in very distinct ways. For example, a 
decision tree split on a categorical feature will give rise to as many children as there are values. For 
ordinal and quantitative features, the splits will be binary, with each parent giving rise to two 
children based on a threshold value. As a consequence, tree models treat quantitative features as 
ordinal, ignoring the features scale. When we consider probabilistic models such as the Bayes 
classifier, we can see that it actually treats ordinal features as categorical, and the only way in which 
it can handle quantitative features 1s to turn them into a finite number of discrete values, therefore 
converting them to categorical data. 


Geometric models, 1n general, require features that are quantitative. For example, linear models 
operate in a Euclidean instance space, with the features acting as Cartesian coordinates. Each feature 
value 1s considered as a scalar relationship to other feature values. Distance-based models, such as 
the k-nearest neighbor, can incorporate categorical features by setting the distance to 0 for equal 
values and | for unequal values. Similarly, we can incorporate ordinal features into distance-based 
models by counting the number of values between two values. If we are encoding feature values as 
integers, then the distance is simply the numerical difference. By choosing an appropriate distance 
metric, 1t 1s possible to incorporate ordinal and categorical features into distance-based models. 
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Structured features 


We assume that each instance can be represented as a vector of feature values and that all relevant 
aspects are represented by this vector. This is sometimes called an abstraction because we filter out 
unnecessary information and represent a real-world phenomena with a vector. For example, 
representing the entire works of Leo Tolstoy as a vector of word frequencies 1s an abstraction. We 
make no pretense that this abstraction will serve any more than a very particular limited application. 
We may learn something about Tolstoy's use of language and perhaps elicit some information 
regarding the sentiment and subject of Tolstoy's writing. However, we are unlikely to gain any 
significant insights into the broad canvas of the jgth century Russia portrayed in these works. A 
human reader, or a more sophisticated algorithm, will gain these insights not from the counting of each 
word but by the structure that these words are part of. 


We can think of structured features 1n a similar way to how we may think about queries 1n a database 
programming language, such as SQL. A SQL query can represent an aggregation over variables to do 
things such as finding a particular phrase or finding all the passages involving a particular character. 
What we are doing in a machine learning context 1s creating another feature with these aggregate 
properties. 


Structured features can be created prior to building the model or as part of the model itself. In the first 
case, the process can be understood as being a translation from the first order logic to a propositional 
logic. A problem with this approach is that 1t can create an explosion in the number of potential 
features as a result of combinations with existing features. Another important point 1s that, in the same 
way that in SQL one clause can cover a subset of another clause, structural features can also be 
logically related. This is exploited in the branch of machine learning that 1s particularly well suited to 
natural language processing, known as inductive logic programming. 
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Transforming features 


When we transform features, our aim, obviously, is to make them more useful to our models. This can 
be done by adding, removing, or changing information represented by the feature. A common feature 
transformation 1s that of changing the feature type. A typical example is binarization, that is, 
transforming a categorical feature into a set of binary ones. Another example 1s changing an ordinal 
feature into a categorical feature. In both these cases, we lose information. In the first instance, the 
value of a single categorical feature 1s mutually exclusive, and this is not conveyed by the binary 
representation. In the second instance, we lose the ordering information. These types of 
transformations can be considered inductive because they consist of a well-defined logical procedure 
that does not involve an objective choice apart from the decision to carry out these transformations in 
the first place. 


Binarization can be easily carried out using the sklearn.preprocessing.Binarizer module. Let's 
take a look at the following commands: 


from sklearn.preprocessing import Binarizer 
from random import randint 

bin=Binarizer (5) 

X=[randint(0,10) for b in range(1,10) ] 
print (X) 

print (bin. transform (xX) ) 


The following 1s the output for the preceding commands: 


is. fk tt, So A ed 
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Features that are categorical often need to be encoded into integers. Consider a very simple dataset 
with just one categorical feature, City, with three possible values, Sydney, Perth, and Melbourne, and 
we decide to encode the three values as 0, 1, and 2, respectively. If this information is to be used ina 
linear classifier, then we write the constraint as a linear inequality with a weight parameter. The 
problem, however, is that this weight cannot encode for a three way choice. Suppose we have two 
classes, east coast and west coast, and we need our model to come up with a decision function that 
will reflect the fact that Perth 1s on the west coast and both Sydney and Melbourne are on the east 
coast. With a simple linear model, when the features are encoded in this way, then the decision 
function cannot come up with a rule that will put Sydney and Melbourne in the same class. The 
solution is to blow up the feature space to three features, each getting their own weights. This is 
called one hot encoding. Sciki-learn implements the oneHotEncoder () function to perform this task. 
This is an estimator that transforms each categorical feature, with m possible values into m binary 
features. Consider that we are using a model with data that consists of the city feature as described in 
the preceding example and two other featuresy;gender, which can be either male or female, and an 
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occupation, which can have three values—doctor, lawyer, or banker. So, for example, a female 
banker from Sydney would be represented as //,2,0/. Three more samples are added for the 
following example: 


from sklearn.preprocessing import OneHotEncoder 
enc = OneHotEncoder () 

enc.fit([[1,2,0], [1, 1, O], [0, 2, 1], [1, 0, 2]]) 
print(enc.transform([1,2,0]) .toarray () ) 


We will get the following output: 


/[0. 1. 0.0. 1. 1. 0. 0] 


Since we have two genders, three cities, and three jobs 1n this dataset, the first two numbers in the 
transform array represent the gender, the next three represent the city, and the final three represent the 
occupation. 
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Discretization 


I have already briefly mentioned the idea of thresholding in relation to decision trees, where we 
transform an ordinal or quantitative feature into a binary feature by finding an appropriate feature 
value to split on. There are a number of methods, both supervised and unsupervised, that can be used 
to find an appropriate split in continuous data, for example, using the statistics of central tendency 
(supervised), such as the mean or median or optimizing an objective function based on criteria such 
as information gain. 


We can go further and create multiple thresholds, transforming a quantitative feature into an ordinal 
one. Here, we divide a continuous quantitative feature into numerous discrete ordinal values. Each of 
these values is referred to as a bin, and each bin represents an interval on the original quantitative 
feature. Many machine learning models require discrete values. It becomes easier and more 
comprehensible to create rule-based models using discrete values. Discretization also makes features 
more compact and may make our algorithms more efficient. 


One of the most common approaches is to choose bins such that each bin has approximately the same 
number of instances. This is called equal frequency discretization, and if we apply it to just two 
bins, then this is the same as using the median as a threshold. This approach can be quite useful 
because the bin boundaries can be set up 1n such a way that they represent quantiles. For example, if 
we have 100 bins, then each bin represents a percentile. 


Alternatively, we can choose the boundaries so that each bin has the same interval width. This is 
called equal width discretization. A way of working out the value of this bin's width interval 1s 
simply to divide the feature range by the number of bins. Sometimes, the features do not have an upper 
or lower limit, and we cannot calculate its range. In this case, integer numbers of standard deviations 
above and below the mean can be used. Both width and frequency discretization are unsupervised. 
They do not require any knowledge of the class labels to work. 


Let's now turn our attention to supervised discretization. There are essentially two approaches: the 
top-down or divisive, and the agglomerative or bottom-up approach. As the names suggest, divisive 
works by initially assuming that all samples belong to the same bin and then progressively splits the 
bins. Agglomerative methods begin with a bin for each instance and progressively merges these bins. 
Both methods require some stopping criteria to decide if further splits are necessary. 


The process of recursively partitioning feature values through thresholding is an example of divisive 
discretization. To make this work, we need a scoring function that finds the best threshold for 
particular feature values. A common way to do this 1s to calculate the information gain of the split or 
its entropy. By determining how many positive and negative samples are covered by a particular split, 
we can progressively split features based on this criterion. 


Simple discretization operations can be carried out by the Pandas cut and qcut methods. Consider the 
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import pandas as pd 
import numpy as np 
print(pd.cut(np.array([1,2,3,4]), 3, retbins = True, right = False) ) 


Here is the output observed: 


(([z, 2), [2, 3), , 4-903), [S, 4.903))] 
Categories (3, object): [[1, 2) < [2?, 3) < [3, 4.003)], array([ 1. 





3. . 4,.063)])) 
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Normalization 


Thresholding and discretization, both, remove the scale of a quantitative feature and, depending on the 
application, this may not be what we want. Alternatively, we may want to add a measure of scale to 
ordinal or categorical features. In an unsupervised setting, we refer to this as normalization. This is 
often used to deal with quantitative features that have been measured on a different scale. Feature 
values that approximate a normal distribution can be converted to z scores. This 1s simply a signed 
number of standard deviations above or below the mean. A positive z score indicates a number of 
standard deviations above the mean, and a negative z score indicates the number of standard 
deviations below the mean. For some features, it may be more convenient to use the variance rather 
than the standard deviation. 


A stricter form of normalization expresses a feature on a 0 to | scale. If we know a features range, we 
can simply use a linear scaling, that is, divide the difference between the original feature value and 
the lowest value with the difference between the lowest and highest value. This 1s expressed 1n the 
following: 


x AF 4) 


Here, f,, 1s the normalized feature, fis the original feature, and / and / are the lowest and highest 


values, respectively. In many cases, we may have to guess the range. If we know something about a 
particular distribution, for example, 1n a normal distribution more than 99% of values are likely to 

fall within +3 or -3 standard deviations of the mean, then we can write a linear scaling such as the 

following: 


» AF=e) 1 
In = (6c) *5 


Here, yz 1s the mean and ois the standard deviation. 
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Calibration 


Sometimes, we need to add scale information to an ordinal or categorical feature. This 1s called 
feature calibration. It is a supervised feature transformation that has a number of important 
applications. For example, it allows models that require scaled features, such as linear classifiers, to 
handle categorical and ordinal data. It also gives models the flexibility to treat features as ordinal, 
categorical, or quantitative. For binary classification, we can use the posterior probability of the 
positive class, given a features value, to calculate the scale. For many probabilistic models, such as 
naive Bayes, this method of calibration has the added advantage 1n that the model does not require 
any additional training once the features are calibrated. For categorical features, we can determine 
these probabilities by simply collecting the relative frequencies from a training set. 


There are cases where we might need to turn quantitative or ordinal features in to categorical features 
yet maintain an ordering. One of the main ways we do this is through a process of logistic calibration. 
If we assume that the feature is normally distributed with the same variance, then it turns out that we 
can express a likelihood ratio, the ration of positive and negative classes, given a feature value v, as 
follows: 





' ( P(v| pos) _—— 
LR |= —qxcKx— = PTI d '- 
(v) (P(v|neg)) exp(d'z) 





Where d prime is the difference between the means of the two classes divided by the standard 
deviation: 


| (gpos — uuneg ) 


=; 


Also, z is the z score: 


(v—p1) 


——— assuming equal class distribution: u = ; 
O : 


( Li pos + neg ) 


To neutralize the effect of nonuniform class distributions, we can calculate calibrated features using 
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This, you may notice, is exactly the sigmoid activation function we used for logistic regression. To 
summarize logistic calibration, we essentially use three steps: 


1. Estimate the class means for the positive and negative classes. 
2. Transform the features into z scores. 
3. Apply the sigmoid function to give calibrated probabilities. 


Sometimes, we may skip the last step, specifically if we are using distance-based models where we 
expect the scale to be additive in order to calculate Euclidian distance. You may notice that our final 
calibrated features are multiplicative in scale. 


Another calibration technique, isotonic calibration, 1s used on both quantitative and ordinal features. 
This uses what is known as a ROC curve (stands for Receiver Operator Characteristic) similar to 
the coverage maps used in the discussion of logical models in Chapter 4, Models — Learning from 
Information. The difference 1s that with an ROC curve, we normalize the axis to /0,1/. 


We can use the sklearn package to create an ROC curve: 


import matplotlib.pyplot as plt 

from sklearn import svm, datasets 

from sklearn.metrics import roc curve, auc 

from sklearn.cross validation import train test split 
from sklearn.preprocessing import label binarize 

from sklearn.multiclass import OneVsRestClassifier 


X, y = datasets.make classification(n samples=100,n classes=3,n features=5, 

n informative=3, n_redundant=0,random_ state=42) 

# Binarize the output 

y = label binarize(y, classes=[0, 1, 2]) 

n classes = y.shape[1] 

X train, X test, y train, y test = train test _split(X, y, test_size=.5) 
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, )) 


y_ score = classifier.fit(X train, y train) .decision function(xX test) 
fpr, tpr, = roc curve(y test[:,0], y score[:,0]) 
roc auc = auc(fpr, tpr) 


plt.figure () 

plt.plot(fpr, tpr, label='ROC AUC %0.2f' % roc auc) 
plt.plot([0, 1], [0, 1], 'k--') 

plt.xlim([0.0, 1.0]) 

plt.ylim([0.0, 1.05]) 


! 
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plt.ylabel('True Positive Rate') 
plt.title('Receiver operating characteristic') 
plt.legend (loc="best") 

plt.show() 


Here is the output observed: 


Receiver operating characteristic 


— ROC AUC 0.70 


iL) 
de! 
ie) 
co 
i 
= 
a 
c 
i 
(i 


04 0.6 
False Positive Rate 





The ROC curve maps the true positive rates against the false positive rate for different threshold 
values. In the preceding diagram, this is represented by the dotted line. Once we have constructed the 
ROC curve, we calculate the number of positives, m;, and the total number of instances, ;, in each 


segment of the convex hull. The following formula is then used to calculate the calibrated feature 
values: 


(m, +1) 
(m, +1+¢(n —m,+ 1) 


In this formula, c is the prior odds, that is, the ratio of the probability of the positive class over the 
probability of the negative class. 


So far in our discussion on feature transformations, we assumed that we know all the values for every 
feature. In the real world, this 1s often not the case. If we are working with probabilistic models, we 
can estimate the value of a missing feature by taking a weighted average over all features values. An 
important consideration is that the existence of missing feature values may be correlated with the 
target variable. For example, data in an individual's medical history 1s a reflection of the types of 
testing that are performed, and this in turn is related to an assessment on risk factors for certain 
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If we are using a tree model, we can randomly choose a missing value, allowing the model to split on 
it. This, however, will not work for linear models. In this case, we need to fill 1n the missing values 
through a process of imputation. For classification, we can simply use the statistics of the mean, 
median, and mode over the observed features to impute the missing values. If we want to take feature 
correlation into account, we can construct a predictive model for each incomplete feature to predict 
missing values. 


Since scikit-learn estimators always assume that all values in an array are numeric, missing values, 
either encoded as blanks, NaN, or other placeholders, will generate errors. Also, since we may not 
want to discard entire rows or columns, as these may contain valuable information, we need to use an 
imputation strategy to complete the dataset. In the following code snippet, we will use the Imputer 
class: 


from sklearn.preprocessing import Binarizer, Imputer, OneHotEncoder 


imp = Imputer(missing values='NaN', strategy='mean', axis=0) 
print(imp.fit transform([[1, 3], [4, np.nan], [5, 6]])) 


Here 1s the output: 





Many machine learning algorithms require that features are standardized. This means that they will 
work best when the individual features look more or less like normally distributed data with near- 
zero mean and unit variance. The easiest way to do this 1s by subtracting the mean value from each 
feature and scaling it by dividing by the standard deviation. This can be achieved by the scale () 
function or the standardScaler() function in the sklearn.preprocessing() function. Although 
these functions will accept sparse data, they probably should not be used 1n such situations because 
centering sparse data would likely destroy its structure. It 1s recommended to use the 
MacAbsScaler() Of maxabs scale() function in these cases. The former scales and translates each 
feature individually by its maximum absolute value. The latter scales each feature individually to a 
range of /-/,//. Another specific case 1s when we have outliers in the data. In these cases using the 
robust scale() OF RobustScaler() function is recommended. 


Often, we may want to add complexity to a model by adding polynomial terms. This can be done 
using the PolynomialFeatures () function: 


from sklearn.preprocessing import PolynomialFeatures 
X=np.arange (9) .reshape (3,3) 

poly=PolynomialFeatures (degree=2) 

print (X) 

print (poly.fit transform (X) ) 
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We will observe the following output: 

[[0 1 2] 

[3 4 5] 

[67 8} 
[101200012 4] 
[13459 12 15 16 20 25] 
[1 6 7 8 36 42 48 49 56 64]| 
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Principle component analysis 


Principle Component Analysis (PCA) is the most common form of dimensionality reduction that we 
can apply to features. Consider the example of a dataset consisting of two features and we would like 
to convert this two-dimensional data into one dimension. A natural approach would be to draw a line 
of the closest fit and project each data point onto this line, as shown 1n the following graph: 





PCA attempts to find a surface to project the data by minimizing the distance between the data points 
and the line we are attempting to project this data to. For the more general case where we have n 
dimensions and we want to reduce this space to k-dimensions, we find & vectors u(1),u(2), ..., u(k) 
onto which to project the data so as to mimimize the projection error. That is we are trying to find a k- 
dimensional surface to project the data. 


This looks superficially like linear regression however it is different in several important ways. With 
linear regression we are trying to predict the value of some output variable given an input variable. In 
PCA we are not trying to predict an output variable, rather we are trying to find a subspace onto 
which to project our input data. The error distances, as represented in the preceding graph, 1s not the 
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closest orthogonal distance between the point and the line. Thus, the error lines are at an angle to the 
axis and forma right angle with our projection line. 


An important point is that in most cases, PCA requires that the features are scaled and are mean 
normalized, that is, the features have zero mean and have a comparable range of values. We can 
calculate the mean using the following formula: 


Hl 


Le 
- : At) 
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The sum is calculated by replacing the following: 


( e 
x, with x, — fh, 


If the features have scales that are significantly different, we can rescale using the following: 


( a ae ) 


Cj 


These functions are available in the sklearn. preprocessing module. 


The mathematical process of calculating both the lower dimensional vectors and the points on these 
vectors where we project our original data involve first calculating the covariance matrix and then 

calculating the eigenvectors of this matrix. To calculate these values from first principles is quite a 

complicated process. Fortunately, the sklearn package has a library for doing just this: 


from sklearn.decomposition import PCA 

import numpy as np 

X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) 
pca = PCA(n_components=1) 

pea. fit (X) 

print (pca. transform (xX) ) 


We will get the following output: 
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| [—-1.3834058] 
| —-2.22189802] 
| -3.60530382 | 
[1.3834058 | 
|2.22189802 | 
[3.60530382 ] | 
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Summary 


There are a rich variety of ways in which we can both transform and construct new features to make 
our models work more efficiently and give more accurate results. In general, there are no hard and 
fast rules for deciding which of the methods to use for a particular model. Much depends on the 
feature types (quantitative, ordinal, or categorical) that you are working with. A good first approach 
is to normalize and scale the features, and if the model requires it, transform the feature to an 
appropriate type, as we do through discretization. If the model performs poorly, it may be necessary 
to apply further preprocessing such as PCA. In the next chapter, we will look at ways in which we 
can combine different types of models, through the use of ensembles, to improve performance and 
provide greater predictive power. 
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Chapter 8. Learning with Ensembles 


The motivation for creating machine learning ensembles comes from clear intuitions and is grounded 
in a rich theoretical history. Diversity, 1n many natural and human-made systems, makes them more 
resilient to perturbations. Similarly, we have seen that averaging results from a number of 
measurements can often result in a more stable models that are less susceptible to random 
fluctuations, such as outliers or errors in data collection. 


In this chapter, we will divide this rather large and diverse space into the following topics: 


Ensemble types 
Bagging 
Random forests 
Boosting 


WOW! eBook 
www.wowebook.org 


Ensemble types 


Ensemble techniques can be broadly divided into two types: 


e Averaging method: This 1s the method 1n which several estimators are run independently and 
their predictions are averaged. This includes random forests and bagging methods. 

e Boosting method: This is the method in which weak learners are built sequentially using 
weighted distributions of the data based on the error rates. 


Ensemble methods use multiple models to obtain better performance than any single constituent 
model. The aim is to not only build diverse and robust models, but also work within limitations, such 
as processing speed and return times. When working with large datasets and quick response times, 
this can be a significant developmental bottleneck. Troubleshooting and diagnostics are an important 
aspect of working with all machine learning models, but especially when we are dealing with models 
that may take days to run. 


The types of machine learning ensembles that can be created are as diverse as the models themselves, 
and the main considerations revolve around three things: how we divide our data, how we select the 
models, and the methods we use to combine their results. This simplistic statement actually 
encompasses a very large and diverse space. 
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Bagging 


Bagging, also called bootstrap aggregating, comes ina few flavors and these are defined by the way 
they draw random subsets from the training data. Most commonly, bagging refers to drawing samples 
with replacement. Because the samples are replaced, it is possible for the generated datasets to 
contain duplicates. It also means that data points may be excluded from a particular generated dataset, 
even if this generated set is the same size as the original. Each of the generated datasets will be 
different and this is a way to create diversity among the models in an ensemble. We can calculate the 
probability that a data point is not selected 1n a sample using the following example: 


== 
ne 


Here, 1 is the number of bootstrap samples. Each of the n bootstrap samples results in a different 
hypothesis. The class is predicted either by averaging the models or by choosing the class predicted 
by the majority of models. Consider an ensemble of linear classifiers. If we use majority voting to 
determine the predicted class, we create a piece-wise linear classifier boundary. If we transform the 
votes to probabilities, then we partition the instance space into segments that can each potentially 
have a different score. 


: ac" 


It should also be mentioned that it is possible, and sometimes desirable, to use random subsets of 
features; this 1s called subspace sampling. Bagging estimators work best with complex models such 
as fully developed decision trees because they can help reduce overfitting. They provide a simple, 
out-of-the-box, way to improve a single model. 


Scikit-learn implements a BaggingClassifier and BaggingRegressor objects. Here are some of 
their most important parameters: 
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These are the samples draymwith penlacement. 


bOOLStrap.tfearures 


Boolean {These are the features drawn with replacement. False 





As an example, the following snippet instantiates a bagging classifier comprising of 50 decision tree 
classifier base estimators each built on random subsets of half the features and half the samples: 


from sklearn.ensemble import BaggingClassifier 
from sklearn.tree import DecisionTreeClassifier 
from sklearn import datasets 


bcls=BaggingClassifier (DecisionTreeClassifier() ,max samples=0.5, 

max features=0.5, n_estimators=50) 
X,y=datasets.make blobs(n_ samples=8000,centers=2, random state=0, cluster std=4) 
bcls.f1it(x,y) 

print (bcls.score(X,y) ) 
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Random forests 


Tree-based models are particularly well suited to ensembles, primarily because they can be sensitive 
to changes in the training data. Tree models can be very effective when used with subspace sampling, 
resulting in more diverse models and, since each model 1n the ensemble is working on only a subset 
of the features, it reduces the training time. This builds each tree using a different random subset of 
the features and is therefore called a random forest. 


A random forest partitions an instance space by finding the intersection of the partitions in the 
individual trees in the forest. It defines a partition that can be finer, that is, will take 1n more detail, 
than a partition created by any individual tree 1n the forest. In principle, a random forest can be 
mapped back to an individual tree, since each intersection corresponds to combining the branches of 
two different trees. The random forest can be thought of as essentially an alternative training 
algorithm for tree-based models. A linear classifier in a bagging ensemble 1s able to learn a 
complicated decision boundary that would be impossible for a single linear classifier to learn. 


The sklearn.ensemble module has two algorithms based on decision trees, random forests and 
extremely randomized trees. They both create diverse classifiers by introducing randomness into their 
construction and both include classes for classification and regression. With the 
RandomForestClassifier and RandomForestRegressor Class each tree is built using bootstrap 
samples. The split chosen by the model is not the best split among all features, but is chosen from a 
random subset of features. 
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Extra trees 


The extra trees method, as with random forests, uses a random subset of features, but instead of 
using the most discriminative thresholds, the best of a randomly generated set of thresholds is used. 
This acts to reduce variance at the expense of a small increase in bias. The two classes are 


ExtraTreesClassifier and ExtraTreesRegressor. 


Let's take a look at an example of the random forest classifier and the extra trees classifier. In 
this example, we use VvotingClassifier to combine different classifiers. The voting classifier can 
help balance out an individual model's weakness. In this example, we pass four weights to the 
function. These weights determine each individual model's contribution to the overall result. We can 
see that the two tree models overfit the training data, but also tend to perform better on the test data. 
We can also see that ExtraTreesClassifier achieved slightly better results on the test set 
compared to the RandomForest object. Also, the votingClasifier object performed better on the 
test set than all its constituent classifiers. It1s worth, while running this with different weightings as 
well as on different datasets, seeing how the performance of each model changes: 


from sklearn import cross validation 

import numpy as np 

import matplotlib.pyplot as plt 

from sklearn.linear model import LogisticRegression 
from sklearn.naive bayes import GaussianNB 

from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import ExtraTreesClassifier 
from sklearn.ensemble import VotingClassifier 

from sklearn import datasets 


def vclas(wl,w2,w3, w4): 


X , y = datasets.make classification(n features= 10, n informative=4, 
n samples=500, n_ clusters per class=5) 

Xtrain,Xtest, ytrain,ytest= 
cross validation.train test split(xX,y,test_ size=0.4) 


cl1f1 LogisticRegression (random state=123) 
cl1lf2 GaussianNB () 
clf3 = RandomForestClassifier(n estimators=10,bootstrap=True, 
random state=123) 
clf4= ExtraTreesClassifier(n estimators=10, bootstrap=True,random_ state=123) 


clfes=[clf1,clf2,clf3,clf4] 


eclf = VotingClassifier(estimators=[('lr', clf1), ('gnb', clf2), ('rft', 
clf3),('et',clf£4)], 
voting='soft', 
weights=[wl, w2, w3,w4]) 


[c.fit(Xtrain, ytrain) for c in WOVEf5°UV £2, cl£3,cl1f£4, eclf) |] 
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N= 5 

ind = np.arange (N) 

width = 0.3 

fig, ax = plt.subplots () 


for i, clf in enumerate(clfes): 
print(clf,i) 
pl=ax.bar(i,clfes[i].score(Xtrain,ytrain,), width=width,color="black") 
p2=ax.bar(it+twidth,clfes[i].score(Xtest,ytest,), 
width=width,color="grey") 
ax.bar(len(clfes)+twidth,eclf.score(Xtrain,ytrain,), 
width=width,color="black") 
ax.bar(len(clfes)+width *2,eclf.score(Xtest,ytest,), 
width=width,color="grey") 
plt.axvline(3.8, color='k', linestyle='dashed' ) 
ax.set_ xticks(ind + width) 
ax.set xticklabels(['LogisticRegression', 
'GaussianNB', 
"RandomForestClassifier', 
'ExtraTrees', 
'VotingClassifier'], 
rotation=40, 
ha='right' ) 
plt.title('Training and test score for different classifiers') 
plt.legend([p1[0], p2[0]], ['training', 'test'], loc='lower left') 
plt. show () 


vcelas(1,3,5,4) 


You will observe the following output: 


Training and test score for different classifiers 


Ma training 
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Tree models allow us to assess the relative rank of features in terms of the expected fraction of 
samples they contribute to. Here, we use one to evaluate the importance of each features in a 
classification task. A feature's relative importance is based on where it is represented 1n the tree. 
Features at the top of a tree contribute to the final decision of a larger proportion of input samples. 


The following example uses an ExtraTreesClassifier class to map feature importance. The 
dataset we are using consists of 10 images, each of 40 people, which is 400 images in total. Each 
image has a label indicating the person's identity. In this task, each pixel is a feature; 1n the output, the 
pixel's brightness represents the feature's relative importance. The brighter the pixel, the more 
important the features. Note that in this model, the brightest pixels are in the forehead region and we 
should be careful how we interpret this. Since most photographs are illuminated from above the head, 
the apparently high importance of these pixels may be due to the fact that foreheads tend to be better 
illuminated, and therefore reveal more detail about an individual, rather than the intrinsic properties 
ofa person's forehead in indicating their identity: 


import matplotlib.pyplot as plt 
from sklearn.datasets import fetch olivetti faces 
from sklearn.ensemble import ExtraTreesClassifier 
data = fetch olivetti_ faces () 
def importance(n estimators=500, max features=128, n_jobs=3, random _state=0): 
X = data.images.reshape((len(data.images), -1)) 
y = data.target 
forest = ExtraTreesClassifier(n estimators,max features=max features, 
n jobs=n_jobs, random_state=random_ state) 
forest.fit(xX, y) 


dstring=" cores=%d..." © n_ jobs + " features=%s..." % max features 
+"estimators=%d..." tn_estimators + "random=%d" %trandom_ state 

print (dstring) 

importances = forest.feature importances _ 

importances = importances.reshape (data.images[0].shape) 


plt.matshow(importances, cmap=plt.cm.hot) 
plt.title(dstring) 
#plt.savefig('etreesImportance'+ dstring + '.png') 
plt. show () 


importance () 


The output for the preceding code 1s as follows: 
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cores—3... features—128.._estimators—500._.random=0 
0 10 20) 30 40 60) 60 


0 


10 
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Boosting 


Earlier in this book, I introduced the idea of the PAC learning model and the idea of concept classes. 
A related idea is that of weak learnability. Here each of the learning algorithms in the ensemble need 
only perform slightly better than chance. For example if each algorithm in the ensemble is correct at 
least 51% of the time then the criteria of weak learnability are satisfied. It turns out that the ideas of 
PAC and weak learnability are essentially the same except that for the latter, we drop the requirement 
that the algorithm must achieve arbitrarily high accuracy. However, it merely performs better than a 
random hypothesis. How is this useful, you may ask? It is often easier to find rough rules of thumb 
rather than a highly accurate prediction rule. This weak learning model may only perform slightly 
better than chance; however, if we boost this learner by running it many times on different weighted 
distributions of the data and by combining these learners, we can, hopefully, build a single prediction 
rule that performs much better than any of the individual weak learning rules. 


Boosting 1s a simple and powerful idea. It extends bagging by taking into account a model's training 
error. For example, if we train a linear classifier and find that it misclassified a certain set of 
instances. If we train a subsequent model on a dataset containing duplicates of these misclassified 
instances, then we would expect that this newly trained model would perform better on a test set. By 
including duplicates of misclassified instances 1n the training set, we are shifting the mean of the data 
set towards these instances. This forces the learner to focus on the most difficult-to-classify 
examples. This is achieved in practice by giving misclassified instances higher weight and then 
modifying the model to take this in to account, for example, ina linear classifier we can calculate the 
class means by using weighted averages. 


Starting from a dataset of uniform weights that sum to one, we run the classifier and will likely 
misclassify some instances. To boost the weight of these instances, we assign them half the total 
weight. For example, consider a classifier that gives us the following results: 


| | Predicted positive|/Pre dicted negative 


16 
bf | 


The error rate is e = (9 + 16)/100 = 0.25. 





We want to assign half the error weight to the misclassified samples, and since we started with 
uniform weights that sum to |, the current weight assigned to the misclassified examples is simply the 
error rate. To update the weights, therefore, we multiply them by the factor //2e. Assuming that the 


error rate is less than 0.5, this results 1n an inoveaéeBndke weights of the misclassified examples. To 
www.wowebook.org 


ensure that the weights still sum to 1, we multiply the correctly classified examples by “(/-e). In this 
example, the error rate, the initial weights of the incorrectly classified samples, is .25 and we want it 
to be .5, that 1s, half the total weights, so we multiply this initial error rate by 2. The weights for the 
correctly classified instances are //2(I-e) = 2/3. Taking these weights into account results into the 
following table: 


| | Predicted positive|/Pre dicted negative roa 


The final piece we need is a confidence factor, a, whichis applied to each model 1n the ensemble. 
This 1s used to make an ensemble prediction based on the weighted averages from each individual 
model. We want this to increase with decreasing errors. A common way to ensure this happens is to 
set the confidence factor to the following: 








So we are given a dataset, such as following: 


(x, yy }. aia (Xin where x; c A . Lf c L = = L, +1 


We then initialize an equal weighted distribution, such as the following: 


Using a weak classifier, h; ,we can write an updated rule as follows: 
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=p 


With the normalization factor, such as the following: 


Z,= >(m. i)exp(- a,v,h, (x ))) 


Note that exp(-y; h;(x;)) 1s positive and greater than | 1f -y; h;(x;) 1s positive, and this happens if x; 1s 
misclassified. The result 1s that the update rule will increase the weight of a misclassified example 
and decrease the weight of correctly classified samples. 


We can write the final classifier as follows: 


‘; 
H (x)= sign | >» ah, ()] 
_ t=] 
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Adaboost 


One of the most popular boosting algorithms is called AdaBoost or adaptive boosting. Here, a 
decision tree classifier 1s used as the base learner and it builds a decision boundary on data that is not 
linearly separable: 


import numpy as np 

import matplotlib.pyplot as plt 

from sklearn.ensemble import AdaBoostClassifier 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.datasets import make blobs 


plot colors = "br" 

plot step = 0.02 

class names = "AB" 

tree= DecisionTreeClassifier () 

boost=AdaBoostClassifier () 

X,y=make blobs (n_ samples=500,centers=2, random _state=0, cluster std=2) 
boost. fit(x,y) 

plt.figure(figsize=(10, 5)) 


# Plot the decision boundaries 

plt.subplot(121) 

x min, x max = X[:, 0].min() - 1, X[:, O].max() + 1 

y min, y max = X[:, 1].min() - 1, X[:, 1].max() + 1 

xx, yy = np.meshgrid(np.arange(x min, x_max, plot step), 
np.arange(y min, y max, plot step) ) 


Z = boost.predict(np.c [xx.ravel(), yy.ravel()]) 
Z = Z.reshape (xx. shape) 

cs = plt.contourf (xx, yy, Z, cmap=plt.cm. Paired) 
plt.axis ("tight") 


for i, n, c in zip(range(2), class names, plot colors): 
idx = np.where(y == 1) 
plt.scatter(X[idx, 0], X[idx, 1], 
c=c, cmap=plt.cm.Paired, 
label="Class %s" % n) 
plt.title('Decision Boundary' ) 


twoclass output = boost.decision function (X) 
plot range = (twoclass output.min(), twoclass output.max() ) 
plt.subplot (122) 
for i, n, c in zip(range(2), class names, plot colors): 
plt.hist(twoclass output[y == i], 
bins=20, 
range=plot range, 
facecolor=c, 
label='Class %s' @n, 
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x1, x2, yl, y2 = plt.axis() 

plt.axis((xl, x2, yl, y2)) 
plt.legend(loc='upper left') 
plt.ylabel('Samples') 

plt.xlabel('Score') 

plt.title('Decision Scores') 

plt.show() 

print("Mean Accuracy =%s£f" % boost.score (X,y) ) 


The following 1s the output of the preceding commands: 


Decision Boundary 


Mean Accuracy = 6.900 
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Gradient boosting 


Gradient tree boosting 1s a very useful algorithm for both regression and classification problems. One 
ofits major advantages 1s that it naturally handles mixed data types, and it is also quite robust to 
outliers. Additionally, it has better predictive powers than many other algorithms; however, its 
sequential architecture makes it unsuitable for parallel techniques, and therefore, it does not scale 
well to large data sets. For datasets with a large number of classes, 1t is recommended to use 
RandomForestClassifier instead. Gradient boosting typically uses decision trees to build a 
prediction model based on an ensemble of weak learners, applying an optimization algorithm on the 
cost function. 


In the following example, we create a function that builds a gradient boosting classifier and graphs its 
cumulative loss versus the number of iterations. The GradientBoostingClassifier Class has an 
oob improvement attribute and is used here calculate an estimate of the test loss on each iteration. 
This gives us a reduction in the loss compared to the previous iteration. This can be a very useful 
heuristic for determining the number of optimum iterations. Here, we plot the cumulative 
improvement of two gradient boosting classifiers. Each classifier 1s identical but for a different 
learning rate, .0/ in the case of the dotted line and .00/ for the solid line. 


The learning rate shrinks the contribution of each tree, and this means that there 1s a tradeoff with the 
number of estimators. Here, we actually see that with a larger learning rate, the model appears to 
reach its optimum performance faster than the model with a lower learning rate. However, this 
models appears to achieve better results overall. What usually occurs 1n practice 1s that 

oob improvement deviates ina pessimistic way over a large number of iterations. Let's take a look 
at the following commands: 


import numpy as np 

import matplotlib.pyplot as plt 

from sklearn import ensemble 

from sklearn.cross validation import train test split 
from sklearn import datasets 


def gbt(params, X,y,1s): 
clf = ensemble.GradientBoostingClassifier (**params) 
clf.fit(X train, y train) 
cumsum = np.cumsum(clf.oob improvement_) 
n = np.arange(params['n estimators']) 
oob best iter = n[np.argmax (cumsum) ] 
plt.xlabel('Iterations') 
plt.ylabel ('Improvement' ) 
pit.axvline(x=oob best iter,linestyle=ls) 
plt.plot(n, cumsum, linestyle=ls) 


X,y=datasets.make blobs(n_ samples=50,centers=5, random _state=0, cluster std=5) 
X train, X_ test, y train, y test = train test split(X, y, test _size=0.5, 
random state=9) 
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pl = {'n_estimators': 1200, 'max depth': 3, 'subsample': 0.5, 


‘learning rate': 0.01, 'min_ samples leaf': 1, 'random_state': 3} 
p2 = {'n_ estimators': 1200, 'max depth': 3, 'subsample': 0.5, 

'learning rate': 0.001, 'min samples leaf': 1, 'random_state': 3} 
gbt (pl, X,y, ls='--') 
gbt(p2, X,y, 1ls='-') 


You will observe the following output: 


1000 1700 


Iterations 
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Ensemble strategies 


We looked at two broad ensemble techniques: bagging, as applied random forests and extra trees, and 
boosting, in particular AdaBoost and gradient tree boosting. There are of course many other variants 
and combinations of these. In the last section of this chapter, I want to examine some strategies for 
choosing and applying different ensembles to particular tasks. 


Generally, in classification tasks, there are three reasons why a model may misclassify a test instance. 
Firstly, 1t may simply be unavoidable if features from different classes are described by the same 
feature vectors. In probabilistic models, this happens when the class distributions overlap so that an 
instance has non-zero likelihoods for several classes. Here we can only approximate a target 
hypothesis. 


The second reason for classification errors 1s that the model does not have the expressive capabilities 
to fully represent the target hypothesis. For example, even the best linear classifier will misclassify 
instances 1f the data 1s not linearly separable. This 1s due to the bias of the classifier. Although there 
is no single agreed way to measure bias, we can see that a nonlinear decision boundary will have less 
bias than a linear one, or that more complex decision boundaries will have less bias than simpler 
ones. We can also see that tree models have the least bias because they can continue to branch until 
only a single instance is covered by each leaf. 


Now, it may seem that we should attempt to minimize bias; however, in most cases, lowering the bias 
tends to increase the variance and vice versa. Variance, as you have probably guessed, is the third 
source of classification errors. High variance models are highly dependent on training data. The 
nearest neighbor's classifier, for example, segments the instance space into single training points. Ifa 
training point near the decision boundary 1s moved, then that boundary will change. Tree models are 
also high variance, but for a different reason. Consider that we change the training data 1n such a way 
that a different feature is selected at the root of the tree. This will likely result in the rest of the tree 
being different. 


A bagged ensemble of linear classifiers 1s able to learn a more complicated decision boundary 
through piecewise construction. Each classifier in the ensemble creates a segment of the decision 
boundary. This shows that bagging, indeed any ensemble method, is capable of reducing the bias of 
high bias models. However, what we find in practice is that boosting is generally a more effective 
way of reducing bias. 


Note 


Bagging is primarily a variance reduction technique and boosting is primarily a bias reduction 
technique. 


Bagging ensembles work most effectively with high variance models, such as complex trees, whereas 


boosting 1s typically used with high bias models such as linear classifiers. 
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We can look at boosting in terms of the margin. This can be understood as being the signed distance 
from the decision boundary; a positive sign indicates the correct class and a negative sign a false one. 
What can be shown 1s that boosting can increase this margin, even when samples are already on the 
correct side of the decision boundary. In other words, boosting can continue to improve performance 
on the test set even when the training error 1s zero. 
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Other methods 


The major variations on ensemble methods are achieved by changing the way predictions of the base 
models are combined. We can actually define this as a learning problem in itself, given that the 
predictions of a set of base classifiers as features learn a meta-model that best combines their 
predictions. Learning a linear meta-model 1s known as stacking or stacked generalization. Stacking 
uses a weighted combination of all learners and, 1n a classification task, a combiner algorithm such as 
logistic regression 1s used to make the final prediction. Unlike bagging or boosting, and like 
bucketing, stacking is often used with models of different types. 


Typical stacking routines involve the following steps: 


1. Split the training set into two disjointed sets. 

2. Train several base learners on the first set. 

3. Test the base learner on the second set. 

4. Use the predictions from the previous step to train a higher level learner. 


Note that the first three steps are identical to cross validation; however, rather than taking a winner- 
takes-all approach, the base learners are combined, possibly nonlinearly. 


A variation on this theme 1s bucketing. Here, a selection algorithm is used to choose the best model 
for each problem. This can be done, for example, using a perception to pick the best model by giving 
a weight to the predictions of each model. With a large set of diverse models, some will take longer 
to train than others. A way to use this in an ensemble 1s to first use the fast but imprecise algorithms to 
choose which slower, but more accurate, algorithms will likely do best. 


We can incorporate diversity using a heterogeneous set of base learners. This diversity comes from 
the different learning algorithms and not the data. This means that each model can use the same 
training set. Often, the base models consist of sets of the same type but with different hyper parameter 
settings. 


Ensembles, in general, consist of a set of base models and a meta-model that are trained to find the 
best way to combine these base models. If we are using a weighted set of models and combining their 
output in some way, we assume that if a model has a weight close to zero, then it will have very little 
influence on the output. It is conceivable that a base classifier has a negative weight, and in this case, 
its prediction would be inverted, relative to the other base models. We can even go further and 
attempt to predict how well a base model is likely to perform even before we train it. This 1s 
sometimes called meta-learning. This involves, first, training a variety of models on a large 
collection of data and constructing a model that will help us answer questions such as which model 1s 
likely to outperform another model on a particular dataset, or does the data indicate that particular 
(meta) parameters are likely to work best? 


Remember that no learning algorithm can outperform another when evaluated over the space of all 
possible problems, such as predicting the next numbers. a&gequence if all possible sequences are 


likely. Of course, learning problems in the real world have nonuniform distributions, and this enables 
us to build prediction models on them. The important question in meta-learning is how to design the 
features on which the meta-model 1s built. They need to combine the relevant characteristics of both 
the trained model and the dataset. This must include aspects of the data beyond the number and type of 
features, and the number of samples. 
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Summary 


In this chapter, we looked at the major ensemble methods and their implementations 1n scikit-learn. It 
is clear that there is a large space to work in and finding what techniques work best for different types 
of problems is the key challenge. We saw that the problems of bias and variance each have their own 
solution, and it is essential to understand the key indicators of each of these. Achieving good results 
usually involves much experimentation, and using some of the simple techniques described 1n this 
chapter, you can begin your journey into machine learning ensembles. 


In the next and last chapter, we will introduce the most important topic—model selection and 
evaluation—and examine some real-world problems from different perspectives. 
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Chapter 9. Design Strategies and Case Studies 


With the possible exception of data munging, evaluating 1s probably what machine learning scientists 
spend most of their time doing. Staring at lists of numbers and graphs, watching hopefully as their 
models run, and trying earnestly to make sense of their output. Evaluation is a cyclical process; we 
run models, evaluate the results, and plug in new parameters, each time hoping that this will result in 
a performance gain. Our work becomes more enjoyable and productive as we increase the efficiency 
of each evaluation cycle, and there are some tools and techniques that can help us achieve this. This 
chapter will introduce some of these through the following topics: 


Evaluating model performance 
Model selection 

Real-world case studies. 

Machine learning design at a glance 
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Evaluating model performance 


Measuring a model's performance is an important machine learning task, and there are many varied 
parameters and heuristics for doing this. The importance of defining a scoring strategy should not be 
underestimated, and in Sklearn, there are basically three approaches: 


e Estimator score: This refers to using the estimator's inbuilt score () method, specific to each 
estimator 

e Scoring parameters: This refers to cross-validation tools relying on an internal scoring strategy 

e Metric functions: These are implemented in the metrics module 


We have seen examples of the estimator score () method, for example, clf.score(). Inthe case ofa 
linear classifier, the score () method returns the mean accuracy. It 1s a quick and easy way to gauge 
an individual estimator's performance. However, this method is usually insufficient 1n itself for a 
number of reasons. 


If we remember, accuracy 1s the sum of the true positive and true negative cases divided by the 
number of samples. Using this as a measure would indicate that 1f we performed a test on a number of 
patients to see if they had a particular disease, simply predicting that every patient was disease free 
would likely give us a high accuracy. Obviously, this is not what we want. 


A better way to measure performance is using by precision, (P) and Recall, (R). If you remember 
from the table in Chapter 4, Models — Learning from Information, precision, or specificity, 1s the 
proportion of predicted positive instances that are correct, that 1s, T7P/(TP+ FP). Recall, or sensitivity, 
is TP/(ITP+FN). The F-measure is defined as 2*R*P/(R+P). These measures ignore the true negative 
rate, and so they are not making an evaluation on how well a model handles negative cases. 


Rather than use the score method of the estimator, 1t often makes sense to use specific scoring 
parameters such as those provided by the cross val score object. This has a cv parameter that 
controls how the data is split. It is usually set as an int, and it determines how many random 
consecutive splits are made on the data. Each of these has a different split point. This parameter can 
also be set to an iterable of train and test splits, or an object that can be used as a cross validation 
generator. 


Also important in cross val score 1s the scoring parameter. This is usually set by a string 
indicating a scoring strategy. For classification, the default 1s accuracy, and some common values are 
fl, precision, recall, as well as the micro-averaged, macro-averaged, and weighted versions of 
these. For regression estimators, the scoring values are mean absolute error,mean squared 


Srror.medtan a0soluce error, and r2. 


The following code estimates the performance of three models on a dataset using 10 consecutive 
splits. Here, we print out the mean of each score, using several measures, for each of the four models. 
In a real-world situation, we will probably need to preprocess our data in one or more ways, and it is 
important to apply these data transformations WOE RAM set as well as the training set. To make this 


w.wowe 


easier, we can use the sklearn.pipeline module. This sequentially applies a list of transforms and 
a final estimator, and it allows us to assemble several steps that can be cross-validated together. 
Here, we also use the StandardScaler/() class to scale the data. Scaling 1s applied to the logistic 
regression model and the decision tree by using two pipelines: 


from sklearn import cross validation 
from sklearn.tree import DecisionTreeClassifier 
from sklearn import svm 
from sklearn.linear model import LogisticRegression 
from sklearn.datasets import samples generator 
from sklearn.preprocessing import LabelEncoder 
from sklearn.preprocessing import StandardScaler 
from sklearn.cross validation import cross val score 
from sklearn.pipeline import Pipeline 
X, y = samples generator.make classification (n samples=1000,n informative=5, 
n redundant=0,random state=42) 
le=LabelEncoder () 
y=le.fit_ transform (y) 
Xtrain, Xtest, ytrain, ytest = cross validation.train test split(xX, y, 
test size=0.5, random _state=1) 
clf1l=DecisionTreeClassifier (max depth=2,criterion='gini') .fit(Xtrain,ytrain) 
clf2= svm.SVC(kernel='linear', probability=True, 
random state=0) .fit(Xtrain,ytrain) 
clf£f3=LogisticRegression (penalty='12', C=0.001) .fit(Xtrain,ytrain) 
pipel=Pipeline([['sc',StandardScaler()],['mod',clf1]]) 
mod labels=['Decision Tree','SVM','Logistic Regression' |] 
print('10 fold cross validation: \n') 
for mod,label in zip([pipel,clf2,clf3], mod labels): 
#print (label) 
auc scores= cross val score(estimator= mod, X=Xtrain, y=ytrain, cv=10, 
scoring ='roc auc') 
p_scores= cross val score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring 
='precision macro' ) 
r scores= cross val score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring 
='recall macro') 
£f scores= cross val score(estimator= mod, X=Xtrain, y=ytrain, cv=10, scoring 
='f1 macro') 


print (label) 

print("auc scores %2f +/- %2£ " % (auc scores.mean(), auc _scores.std() ) ) 
print("precision %2f +/- %2£ " % (p scores.mean(), p_ scores.std())) 
print("recall %2f +/- %2f ]" % (r_scores.mean(), r_scores.std())) 
print("f scores %2f +/- %2£ " % (f scores.mean(), £ scores.std())) 


On execution, you will see the following output: 
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160 fold cross validation: 


Decision Tree 

auc scores 8.697144 +/- 0.056665 
precision 6.706912 +/- 6.865688 
recall 0.648131 +/- 6.043604 | 

f scores 0.678455 +/- 6.951711 
SVM 

auc scores 0./683/4 +/- 0.036460 
precision 6.769994 +/- 6.858011 
recall 0.707064 +/- 6.056323 | 

Ff scores 6.703605 +/- 6.055579 
Logistic Regression 

auc scores 0.754150 +/- 0.046137 
precision 6.688979 +/- 6.877614 
recall 0.686077 +/- 6.07605? | 

f scores 0.687859 +/- 6.075356 





There are several variations on these techniques, most commonly using what is known as k-fold cross 
validation. This uses what is sometimes referred to as the /eave one out strategy. First, the model 1s 
trained using A—1 of the folds as training data. The remaining data 1s then used to compute the 
performance measure. This is repeated for each of the folds. The performance is calculated as an 
average of all the folds. 


Sklearn implements this using the cross validation.KFold object. The important parameters are a 
required int, indicating the total number of elements, and ann folds parameter, defaulting to 3, to 
indicate the number of folds. It also takes optional shuffle and random state parameters indicating 
whether to shuffle the data before splitting, and what method to use to generate the random state. The 
default random state parameter is to use the NumPy random number generator. 


In the following snippet, we use the Lassocv object. This is a linear model trained with LI 
regularization. The optimization function for regularized linear regression, 1f you remember, includes 
a constant (alpha) that multiplies the LI regularization term. The Lassocv object automatically sets 
this alpha value, and to see how effective this 1s, we can compare the selected alpha and the score on 
each of the k-folds: 


import numpy as np 

from sklearn import cross validation, datasets, linear model 
X,y=datasets.make blobs(n_ samples=80,centers=2, random _state=0, cluster std=2) 
alphas = np.logspace(-4, -.5, 30) 

lasso cv = linear model.LassoCVv (alphas=alphas) 

k fold cross validation.KFold(len(X), 5) 

alphas = np.logspace(-4, -.5, 30) 


for k, (train, test) in enumerate (k fold): 
lasso cv.fit(X[train], y[train] ) 
print("[fold {0}] alpha: {1:.5f£}, score: {2:.5f}". 
format(k, lasso cv.alpha_, lasso cv.score(xX[test], y[test]))) 
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The output of the preceding commands 1s as follows: 


score. 
SCOre. 
Score. 


SCOre. 
SCOre. 





Sometimes, it 1s necessary to preserve the percentages of the classes in each fold. This is done using 
Stratified cross validation. It can be helpful when classes are unbalanced, that 1s, when there 1s a 
larger number of some classes and very few of others. Using the stratified cv object may help correct 
defects in models that might cause bias because a class 1s not represented in a fold 1n large enough 
numbers to make an accurate prediction. However, this may also cause an unwanted increase in 
variance. 


In the following example, we use stratified cross validation to test how significant the classification 
score 1s. This is done by repeating the classification procedure after randomizing the labels. The p 
value 1s the percentage of runs by which the score 1s greater than the classification score obtained 
initially. This code snippet uses the cross validation.permutation test score method that 
takes the estimator, data, and labels as parameters. Here, we print out the initial test score, the p 
value, and the score on each permutation: 


import numpy as np 

from sklearn import linear model 

from sklearn.cross validation import StratifiedKFold, permutation test score 
from sklearn import datasets 


X,y=datasets.make classification(n samples=100, n_features=5) 

n classes = np.unique(y) .size 

cls=linear model .LogisticRegression () 

cv = StratifiedKFold(y, 2) 

score, permutation scores, pvalue = permutation test _score(cls, X, y, 
scoring="f1", cv=cv, n_permutations=10, n_jobs=1) 


print("Classification score %s (pvalue : %s)" % (score, pvalue) ) 
print("Permutation scores %s" % (permutation scores) ) 


This gives the following output: 


Classification score @.968962585034 (pvalue : 9.6999099999091) 
Permutation scores [ 8@.36310273 06.57189542 0@.55977011 9.38134058 06.50807139 0.47916667 


G@.47153537 G@.35797519 8.46071429 6.49 ! 
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Model selection 


There are a number of hyper parameters that can be adjusted to improve performance. It is often not a 
straightforward process, determining the effect of the various parameters, both individually and in 
combination with each other. Common things to try include getting more training examples, adding or 
removing features, adding polynomial features, and increasing or decreasing the regularization 
parameter. Given that we can spend a considerable amount of time collecting more data, or 
manipulating data in other ways, it is 1mportant that the time you spend is likely to result ina 
productive outcome. One of the most important ways to do this 1s using a process known as grid 
search. 


WOW! eBook 
www.wowebook.org 


Gridsearch 


The sklearn.grid search.GridSearchcv object 1s used to perform an exhaustive search on 
specified parameter values. This allows iteration through defined sets of parameters and the reporting 
of the result in the form of various metrics. The important parameters for GridSearchcv objects are 
an estimator and a parameter grid. The param grid parameter 1s a dictionary, or list of dictionaries, 
with parameter names as keys and a list of parameter settings to try as values. This enables searching 
over any sequence of the estimators parameter values. Any of an estimator's adjustable parameters 
can be used with grid search. By default, grid search uses the score () function of the estimator to 
evaluate a parameter value. For classification, this is the accuracy, and as we have seen, this may not 
be the best measure. In this example, we set the scoring parameter of the GridSearchcv object to £1. 


In the following code, we perform a search over a range of c values (the inverse regularization 
parameter), under both L1 and L2 regularization. We use the metrics.classification report 
class to print out a detailed classification report: 


from sklearn import datasets 

from sklearn.cross validation import train test split 
from sklearn.grid search import GridSearchCv 

from sklearn.metrics import classification report 

from sklearn.linear model import LogisticRegression as Ilr 


X,y=datasets.make blobs(n_ samples=800,centers=2, random state=0, cluster std=4) 
X train, X_ test, y train, y test = train test split( 
X, y, test _size=0.5, random state=0) 
tuned parameters = [{'penalty': ['ll'], 
'C': [0.01, 0.1, 1, 5]}, 
{'penalty': ['12'], 'C': [0.01, 0.1, 1, 5]}] 
scores = ['precision', 'recall','f1'] 
for score in scores: 
clf = GridSearchCv(lr(C=1), tuned parameters, cv=5, 
scoring='%s weighted' % score) 
clf.fit(X train, y train) 
print("Best parameters on development set:") 
print () 
print(clf.best params ) 
print("Grid scores on development set:") 
for params, mean score, scores in clf.grid_ scores : 
print("%0.3f£ (+/-%0.03£) for %r" 
6 (mean score, scores.std() * 2, params) ) 
print ("classification report:") 
y_true, y pred = y test, clf.predict(xX_test) 
print(classification report(y true, y pred) ) 


We observe the following output: 
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Best parameters on development set: 
{'penalty': '"L1', '€': @.1} 
Grid scores on deve lopment set: 

(+/-0. for {'peanatty': 'l1l', "ec": 
Q (+/-8. for {'penalty': 'l1', ‘e": 
Q (+/-@. for {'penalty': "Li", "e€': 
Q. (+/-@. } for {'penalty': 'L1', 'C': 
a, (+/-0. for {'penalty': 'L2", "e': 
O.¢ (+/-8. for {'panatty': "U2", "e': 
Q (+/-@.1: for {'penatty': "bo", "e': 
Q. (+/-9.132) for {'penalty': "L2"', '¢': 5} 
classification report: 

precision recall fl-score support 


F ; 0.69 169 
1 . val 0.65 211 


avg / total : 65 0.67 400 





Grid search is probably the most used method of optimization hyper parameters, however, there are 
times when it may not be the best choice. The RandomizedSearchcv object implements a randomized 
search over possible parameters. It uses a dictionary similar to the GridSearchcv object, however, 
for each parameter, a distribution can be set, over which a random search of values will be made. If 
the dictionary contains a list of values, then these will be sampled uniformly. Additionally, the 
RandomizedSearchcv object also contains ann iter parameter that is effectively a computational 
budget of the number of parameter settings sampled. It defaults to 10, and at high values, will 
generally give better results. However, this is at the expense of runtime. 


There are alternatives to the brute force approach of the grid search, and these are provided in 
estimators such as Lassocv and ElasticNetcv. Here, the estimator itself optimizes its regularization 
parameter by fitting it along a regularization, path. This is usually more efficient than using a grid 
search. 
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Learning curves 


An important way to understand how a model 1s performing is by using learning curves. Consider 
what happens to the training and test errors as we increase the number of samples. Consider a simple 
linear model. With few training samples, it 1s very easy for it to fit the parameters, the training error 
will be small. As the training set grows, 1t becomes harder to fit, and the average training error will 
likely grow. On the other hand, the cross validation error will likely decrease, at least at the 
beginning, as samples are added. With more samples to train on, the model will be better able to 
acclimatize to new samples. Consider a model with high bias, for example, a simple linear classifier 
with two parameters. This 1s just a straight line, so as we start adding training examples, the cross 
validation error will initially decrease. However, after a certain point, adding training examples will 
not reduce the error significantly simply because of the limitations of a straight line, it simply cannot 
fit nonlinear data. If we look at the training error, we see that, like earlier, it initially increases with 
more training samples, and at a certain point, 1t will approximately equal the cross validation error. 
Both the cross validation and train errors will be high in a high-bias example. What this shows 1s that 
if we know our learning algorithm has high bias, then just adding more training examples will be 
unlikely to improve the model significantly. 


Now, consider a model with high variance, say with a large number of polynomial terms, and a small 
value for the regularization parameter. As we add more samples, the training error will increase 
slowly but remain relatively small. As more training samples are added the error on the cross 
validation set will decrease. This is an indication of over-fitting. The indicative characteristic of a 
model with high variance is a large difference between the training error and the test error. What this 
is Showing is that increasing training examples will lower the cross validation error, and therefore, 
adding training samples is a likely way to improve a model with high variance. 


In the following code, we use the learning curve object to plot the test error and the training error as 
we increase the sample size. This should give you an indication when a particular model is suffering 
from high bias or high variance. In this case, we are using a logistic regression model. We can see 
from the output of this code that the model may be suffering from bias, since both the training test 
errors are relatively high: 


from sklearn.pipeline import Pipeline 

from sklearn.learning curve import learning curve 
import matplotlib.pyplot as plt 

import numpy as np 

from sklearn.preprocessing import StandardScaler 
from sklearn.linear model import LogisticRegression 
from sklearn import cross validation 

from sklearn import datasets 


X, y = datasets.make classification(n samples=2000,n informative=2, 
n redundant=0,random_ state=42) 
Xtrain, Xtest, ytrain, ytest = cross validation.train test split(X, y, 
test _size=0.5, random _state=1) WOW! eBook 
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pipe = Pipeline ([('sc' , StandardScaler()),('clf', LogisticRegression( penalty 
= '12'))]) 

trainSizes, trainScores, testScores = learning curve(estimator=pipe, X=Xtrain, 
y= ytrain,train sizes=np.linspace(0.1,1,10) ,cv=10, n_ jobs=1) 
trainMeanErr=1-np.mean(trainScores, axis=1) 

testMeanErr=1-np.mean(testScores, axis=1) 

plt.plot(trainSizes, trainMeanErr, color='red', marker='0o0', markersize=5, label 
= 'training error') 

plt.plot(trainSizes, testMeanErr, color='green', marker='s', markersize=5, label 
= 'test error') 

plt.grid() 

plt.xlabel('Number of Training Samples') 

plt.ylabel ('Error') 

plt.legend (loc=0) 

plt.show() 


Here is the output of the preceding code: 


0. 8O200=CSCsSC*=<CSSCSC 
Number of Training Samples 
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Real-world case studies 


Now, we will move on to some real-world machine learning scenarios. First, we will build a 
recommender system, and then we will look into some integrated pest management systems in 
greenhouses. 
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Building a recommender system 


Recommender systems are a type of information filtering, and there are two general approaches: 
content-based filtering and collaborative filtering. In content-based filtering, the system attempts to 
model a user's long term interests and select items based on this. On the other hand, collaborative 
filtering chooses items based on the correlation with items chosen by people with similar 
preferences. As you would expect, many systems use a hybrid of these two approaches. 


Content-based filtering 


Content-based filtering uses the content of items, which is represented as a set of descriptor terms, 
and matches them with a user profile. A user profile 1s constructed using the same terms extracted 
from items that the user has previously viewed. A typical online book store will extract key terms 
from texts to create a user profile and to make recommendations. This procedure of extracting these 
terms can be automated 1n many cases, although in situations where specific domain knowledge 1s 
required, these terms may need to be added manually. The manual addition of terms is particularly 
relevant when dealing with non-text based items. It is relatively easy to extract key terms from, say, a 
library of books, say by associating fender amplifiers with electric guitars. In many cases, this will 
involve a human creating these associations based on specific domain knowledge, say by associating 
fender amplifiers with electric guitars. Once this 1s constructed, we need to choose a learning 
algorithm that can learn a user profile and make appropriate recommendations. The two models that 
are most often used are the vector space model and the latent semantic indexing model. With the 
vector space model, we create a sparse vector representing a document where each distinct term ina 
document corresponds to a dimension of the vector. Weights are used to indicate whether a term 
appears in a document. When it does appear, it shows the weight of 1, and when it does not, it shows 
the weight of 0. Weights based on the number of times a word appears are also used. 


The alternative model, latent semantic indexing, can improve the vector model in several ways. 
Consider the fact that the same concept 1s often described by many different words, that is, with 
Synonyms. For example, we need to know that a computer monitor and computer screen are, for most 
purposes, the same thing. Also, consider that many words have more than one distinct meaning, for 
example, the word mouse can either be an animal or a computer interface. Semantic indexing 
incorporates this information by building a term-document matrix. Each entry represents the number 
of occurrences of a particular term in the document. There is one row for each of the terms in a set of 
documents, and there is one column for every document. Through a mathematical process known as 
single value decomposition this single matrix can be decomposed into three matrices representing 
documents and terms as vectors of factor values. Essentially this 1s a dimension reduction technique 
whereby we create single features that represent multiple words. A recommendation is made based 
on these derived features. This recommendation is based on semantic relationships within the 
document rather than simply matching on identical words. The disadvantages of this technique 1s that 
it is computationally expensive and may be slow to run. This can be a significant constraint for a 
recommender system that has to work in realtime. 


Collaborative filtering POP pains org 


Collaborative filtering takes a different approach and is used in a variety of settings, particularly, in 
the context of social media, and there are a variety of ways to implement it. Most take a 
neighborhood approach. This is based on the idea that you are more likely to trust your friends' 
recommendations, or those with similar interests, rather than people you have less in common with. 


In this approach, a weighted average of the recommendations of other people 1s used. The weights are 
determined by the correlation between individuals. That 1s, those with similar preferences will be 
weighted higher than those that are less similar. In a large system with many thousands of users, 1t 
becomes infeasible to calculate all the weights at runtime. Instead, the recommendations of a 
neighborhood are used. This neighborhood 1s selected either by using a certain weight threshold, or 
by selecting based on the highest correlation. 


n the following code, we use a dictionary of users and their ratings of music albums. The geometric 
nature of this model 1s most apparent when we plot users' ratings of two albums. It 1s easy to see that 
the distance between users on the plot 1s a good indication of how similar their ratings are. The 
Euclidean distance measures how far apart users are, in terms of how closely their preferences match. 
We also need a way to take into account associations between two people, and for this we use the 
Pearson correlation index. Once we can compute the similarity between users, we rank them in order 
of similarity. From here, we can work out what albums could be recommended. This is done by 
multiplying the similarity score of each user by their ratings. This 1s then summed and divided by the 
similarity score, essentially calculating a weighted average based on the similarity score. 


Another approach 1s to find the similarities between items. This is called item-based collaborative 
filtering; this 1n contrast with user-based collaborative filtering, which we used to calculate the 
similarity score. The item-based approach 1s to find similar items for each item. Once we have the 
similarities between all the albums, we can generate recommendations for a particular user. 


Let's take a look at a sample code implementation: 


import pandas as pd 
from scipy.stats import pearsonr 
import matplotlib.pyplot as plt 


userRatings={'Dave': {'Dark Side of Moon': 9.0, 

'Hard Road': 6.5,'Symphony 5': 8.0,'Blood Cells': 4.0},'Jen': {'Hard Road': 
7.0,'Symphony 5': 4.5, 'Abbey Road':8.5,'Ziggy Stardust': 9,'Best Of 
Miles':7},'Roy': {'Dark Side of Moon': 7.0,'Hard Road': 3.5, 'Blood Cells': 
4,'Vitalogy': 6.0,'Ziggy Stardust': 8,'Legend': 7.0, 'Abbey Road': 4}, 'Rob': 
{'Mass in B minor': 10,'Symphony 5': 9.5,'Blood Cells': 3.5,'Ziggy Stardust': 
8,'Black Star': 9.5,'Abbey Road': 7.5},'Sam': {'Hard Road': 8.5,'Vitalogy': 
5.0,'Legend': 8.0,'Ziggy Stardust': 9.5,'U2 Live': 7.5,'Legend': 9.0, 'Abbey 
Road': 2},'Tom': {'Symphony 5': 4,'U2 Live': 7.5,'Vitalogy': 7.0, 'Abbey Road': 
4.5}, 'Kate': {'Horses': 8.0,'Symphony 5': 6.5,'Ziggy Stardust': 8.5,'Hard Road': 
6.0,'Legend': 8.0,'Blood Cells': 9,'Abbey Road': 6}} 


# Returns a distance-based similarity score for userl and user2 
def distance (prefs,userl,user2) : 
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si={} 
for item in prefs[userl1]: 

if item in prefs[user2]: 

si[item]=1 

# if they have no ratings in common, return 0 
1f len(si)==0: return 0 
# Add up the squares of all the differences 
sum of squares=sum([pow(prefs[user1] [item] -prefs[user2] [item] , 2) 
for item in prefs[userl] if item in prefs[user2]]) 
return 1/(1+sum_of squares) 


def Matches (prefs,person,n=5,similarity=pearsonr) : 


scores=[ (similarity (prefs,person,other) ,other) 
for other in prefs if other!=person] 

scores.sort( ) 

scores.reverse( ) 

return scores[0:n] 


def getRecommendations (prefs,person,similarity=pearsonr) : 


def 


totals={ } 
simSums={ } 
for other in prefs: 
1£ other==person: continue 
Sim=similarity (prefs,person,other) 
1f sim<=0: continue 
for item in prefs[other]: 
# only score albums not yet rated 
1£f item not in prefs[person] or prefs[person] [item]== 
# Similarity * Score 
totals .setdefault (item, 0) 
totals [item]+=prefs [other] [item] *sim 
# Sum of similarities 
simSums.setdefault (item, 0) 
simSums [item] +=sim 
# Create a normalized list 
rankings=[(total/simSums[item],item) for item,total in totals.items( ) ] 
# Return a sorted list 
rankings.sort( ) 
rankings.reverse( ) 
return rankings 


transformPrefs (prefs) : 
result={ } 
for person in prefs: 
for item in prefs[person]: 
result.setdefault (item, {}) 
# Flip item and person 
result[item] [person]=prefs [person] [item] 
return result 


transformPrefs (userRatings) 
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def 


def 


calculateSimilarItems (prefs ,n=10): 

# Create a dictionary similar items 

result={ } 

# Invert the preference matrix to be item-centric 

itemPrefs=transformPrefs (prefs) 

for item in itemPrefs: 
if c%100==0: print("%d / %d" % (c,len(itemPrefs) ) ) 
scores=Matches (itemPrefs,item,n=n,similarity=distance) 
result[item]=scores 

return result 


getRecommendediItems (prefs,itemMatch,user) : 
userRatings=prefs [user] 

scores={ } 

totalSim={ } 


# Loop over items rated by this user 
for (item,rating) in userRatings.items( ): 


# Loop over items similar to this one 
for (similarity,item2) in itemMatch[item]: 


# Ignore if this user has already rated this item 
if item2 in userRatings: continue 


# Weighted sum of rating times similarity 
scores.setdefault (item2,0) 
scores [item2]+=similarity*rating 


# Sum of all the similarities 
totalSim.setdefault (item2,0) 
totalSim[item2]+=similarity 


# Divide each total score by total weighting to get an average 
rankings=[(score/totalSim[item] ,item) for item,score in scores.items( )] 


# Return the rankings from highest to lowest 
rankings.sort( ) 

rankings.reverse( ) 

return rankings 


itemsim=calculateSimilarItems (userRatings) 


def plotDistance(albuml, album2): 


data=[] 
for iin userRatings.keys(): 
try: 
data.append((i,userRatings[i] [albuml], userRatings [i] [album2] ) ) 
except: 
pass 
df=pd.DataFrame (data=data, columns = ['user', albuml, album2] ) 


plt.scatter (df[album1] ,df[album2Wpw! eBook 
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plt.xlabel (albuml1) 
plt.ylabel (album2) 
for i,t in enumerate (df.user) : 
plt.annotate(t, (df[album1] [1], df[album2] [i]) ) 
plt. show () 
print (df) 


plotDistance('Abbey Road', 'Ziggy Stardust' ) 
print (getRecommendedItems (userRatings, itemsim, 'Dave') ) 


You will observe the following output: 
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Abbey Road 


user Abbey Road Ziggy Stardus 

Sam 2.4 

Kate 

Roy 

Jen a 

Rob 
My recomendations: 
[(8.622815087396503, ‘Ziggy Stardust'), (8.378259676091135, '‘Legend'), 
(7.894736842105265, ‘Black Star'), (7.887649449438202, 'Mass in B minor'), 
(7.581141117864468, ‘Abbey Road'), (7.55722891566265, ‘Vitalogy'), 
(6.69672131147541, "U2 Live"), (6.681818181818182, "Best Of Miles'), 
(5.7175572519083975, "Horses')] 





Here we have plotted the user ratings of two albums, and based on this, we can see that the users 
Kate and Rob are relatively close, that 1s, their preferences with regard to these two albums are 
similar. On the other hand, the users Rob and Sam are far apart, indicating different preferences for 
these two albums. We also print out recommendations for the user Dave and the similarity score for 
each album recommended. 


Since collaborative filtering is reliant on the ratings of other users, a problem arises when the number 
of documents becomes much larger than the number of ratings, so the number of items that a user has 
rated 1s a tiny proportion of all the items. There are a few different approaches to help you fix this. 
Ratings can be inferred from the type of items they browse for on the site. Another way 1s to 
supplement the ratings of users with content-based fdteing in a hybrid approach. 
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Reviewing the case study 


Some important aspects of this case study are as follows: 


It is part of a web application. It must run in realtime, and it relies on user interactivity. 

There are extensive practical and theoretical resources available. This is a well thought out 
problem and has several well defined solutions. We do not have to reinvent the wheel. 

This is largely a marketing project. It has a quantifiable metric of success in that of sale volumes 
based on recommendation. 

The cost of failure 1s relatively low. A small level of error is acceptable. 
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Insect detection in greenhouses 


A growing population and increasing climate variability pose unique challenges for agriculture 1n the 


21S¢ century. The ability of controlled environments, such as greenhouses, to provide optimum 
erowing conditions and maximize the efficient use of inputs, such as water and nutrients, will enable 
us to continue to feed growing populations in a changing global climate. 


There are many food production systems that today are largely automated, and these can be quite 
sophisticated. Aquaculture systems can cycle nutrients and water between fish tanks and growing 
racks, in essence, creating a very simple ecology in an artificial environment. The nutrient content of 
the water is regulated, as are the temperature, moisture levels, humidity, and carbon dioxide levels. 
These features exist within very precise ranges to optimize for production. 


The environmental conditions inside greenhouses can be very conducive to the rapid spread of 
disease and pests. Early detection and the detection of precursor symptoms, such as fungi or insect 
ege production, are essential to managing these diseases and pests. For environmental, food quality, 
and economic reasons, we want to only apply minimum targeted controls, since this mostly involves 
the application, a pesticide, or any other bio agent. 


The goal here is to create an automated system that will detect the type and location of a disease or 
insect and subsequently choose, and ideally implement, a control. This 1s quite a large undertaking 
with a number of different components. Many of the technologies exist separately, but here we are 
combining them in a number of non-standard ways. The approach is largely experimental: 
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The usual method of detection has been direct human observation. This is a very time intensive task 
and requires some particular skills. It is also very error prone. Automating this would be of huge 
benefit 1n itself, as well as being an important starting point for creating an automated IPM system. 
One of the first tasks is to define a set of indicators for each of the targets. A natural approach would 
be to get an expert, or a panel of experts, to classify short video clips as either being pest free or 
infected with one or more target species. Next, a classifier 1s trained on these clips, and hopefully, it 
is able to obtain a prediction. This approach has been used in the past, for example, Early Pest 
Detection in Greenhouses (Martin, Moisan, 2004), 1n the detection of insect pests. 


In a typical setup, video cameras are placed throughout the greenhouse to maximize the sampling area. 
For the early detection of pests, key plant organs such as the stems, leaf nodes, and other areas are 
targeted. Since video and image analysis can be computationally expensive, motion sensitive cameras 
that are intelligently programmed to begin recording when they detect insect movement can be used. 


The changes in early outbreaks are quite subtle and can be indicated to be a combination of plant 
damage, discolorations, reduced growth, and the presence of insects or their eggs. This difficulty 1s 
compounded by the variable light conditions in greenhouses. A way of coping with these issues 1s to 
use a cognitive vision approach. This divides the problem into a number of sub-problems, each of 
which is context dependent. For example, the use a different model for when it 1s sunny, or based on 
the light conditions at different times of the day. The knowledge of this context can be built into the 
model at a preliminary, weak learning stage. This gives it an inbuilt heuristic to apply an appropriate 
learning algorithm in a given context. 


An important requirement 1s that we distinguish between different insect species, and a way to do this 
is by capturing the dynamic components of insects, that 1s, their behavior. Many insects can be 
distinguished by their type of movement, for example, flying in tight circles, or stationary most of the 
time with short bursts of flight. Also, insects may have other behaviors, such as mating or laying eggs, 
that might be an important indicator of a control being required. 


Monitoring can occur over a number of channels, most notably video and still photography, as well as 
using signals from other sensors such as infrared, temperature, and humidity sensors. All these inputs 
need to be time and location stamped so that they can be used meaningfully in a machine learning 
model. 


Video processing first involves subtracting the background and isolating the moving components of 
the sequence. At the pixel-level, the lighting condition results 1n a variation of intensity, saturation, 

and inter-pixel contrast. At the image level, conditions such as shadows affect only a portion of the 
image, whereas backlighting affects the entire 1mage. 


In this example, we extract frames from the video recordings and process them in their own separate 
path in the system. As opposed to video processing, where we were interested in the sequence of 
frames over time in an effort to detect movement, here we are interested 1n single frames from several 
cameras, focused on the same location at the same time. This way, we can build up a three- 


dimensional model, and this can be useful, espeewaldofér tracking changes to biomass volume. 
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The final inputs for our machine learning model are environmental sensors. Standard control systems 
measure temperature, relative humidity, carbon dioxide levels, and light. In addition, hyper-spectral 
and multi-spectral sensors are capable of detecting frequencies outside the visible spectrum. The 
nature of these signals requires their own distinctive processing paths. As an example of how they 
might be used, consider that one of our targets 1s a fungus that we know exists 1n a narrow range of 
humidity and temperature. Supposing an ultraviolet sensor in a part of the greenhouse briefly detects 
the frequency range indicative of the fungi. Our model would register this, and if the humidity and 
temperature are in this range, then a control may be initiated. This control may be simply the opening 
of a vent or the switching on of a fan near the possible outbreak to locally cool the region to a 
temperature at which the fungi cannot survive. 


Clearly, the most complex part of the system is the action controller. This really comprises of two 
elements: A multi label classifier outputting a binary vector representing the presence or not of the 
target pests and the action classifier itself which outputs a control strategy. 


There are many different components and a number of distinct systems that are needed to detect the 
various pathogens and pests. The standard approach has been to create a separate learning model for 
each target. This multi-model approach works if we are instigating controls for each of these as 
separate, unrelated activities. However, many of the processes, such as the development and spread 
of disease and a sudden outbreak of insects, may be precipitated by a common cause. 


Reviewing the case study 


Some important aspects of this case study are as follows: 


e Itis largely a research project. It has a long timeline involving a large space of unknowns. 

e [t comprises a number of interrelated systems. Each one can be worked on separately, but at 
some point needs to be integrated back into the entire system. 

e It requires significant domain knowledge. 
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Machine learning at a glance 


The physical design process (involving humans, decisions, constraints, and the most potent of all: 
unpredictability) has parallels with the machine learning systems we are building. The decision 
boundary of a classifier, data constraints, and the uses of randomness to initialize or introduce 
diversity in models are just three connections we can make. The deeper question 1s how far can we 
take this analogy. If we are trying to build artificial intelligence, the question 1s, "Are we trying to 
replicate the process of human intelligence, or simply imitate 1ts consequences, that is, make a 
reasonable decision?" This of course 1s ripe for vigorous philosophical discussion and, though 
interesting, is largely irrelevant to the present discussion. The important point, however, 1s that much 
can be learned from observing natural systems, such as the brain, and attempting to mimic their 
actions. 


Real human decision making occurs 1n a wider context of complex brain action, and in the setting of a 
design process, the decisions we make are often group decisions. The analogy to an artificial neural 
net ensemble is irresistible. Like with an ensemble of learning candidates with mostly weak learners, 
the decisions made, over the lifespan of a project, will end up with a result far greater than any 
individual contribution. Importantly, an incorrect decision, analogous say to a poor split in a decision 
tree, 1s not wasted time since part of the role of weak learners 1s to rule out incorrect possibilities. In 
a complex machine learning project, it can be frustrating to realize that much of the work done does 
not directly lead to a successful result. The initial focus should be on providing convincing arguments 
that a positive result is possible. 


The analogy between machine learning systems and the design process itself 1s, of course, over 
simplistic. There are many things in team dynamics that are not represented by a machine learning 
ensemble. For example, human decision making occurs in the rather illusive context of emotion, 
intuition, and a lifetime of experience. Also, team dynamics are often shaped by personnel ambition, 
subtle prejudices, and by relationships between team members. Importantly, managing a team must be 
integrated into the design process. 


A machine learning project of any scale will require collaboration. The space is simply too large for 
any one person to be fully cognizant of all the different interrelated elements. Even the simple 
demonstration tasks outlined in this book would not be possible if it were not for the effort of many 
people developing the theory, writing the base algorithms, and collecting and organizing data. 


Successfully orchestrating a major project within time and resource constraints requires significant 
skill, and these are not necessarily the skills of a software engineer or a data scientist. Obviously, we 
must define what success, in any given context, means. A theoretical research project either 
disproving or proving a particular theory with a degree of certainty, or a small degree of uncertainty, 
is considered a success. Understanding the constraints may give us realistic expectations, 1n other 
words, an achievable metric of success. 


One of the most common and persistent constrwiits ¢esdhat of insufficient, or inaccurate, data. The data 
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collection methodology is such an important aspect, yet in many projects it is overlooked. The data 
collection process 1s interactive. It is impossible to interrogate any dynamic system without changing 
that system. Also, some components of a system are simply easier to observe than others, and 
therefore, may become inaccurate representations of wider unobserved, or unobservable, 
components. In many cases, what we know about a complex system 1s dwarfed by what we do not 
know. This uncertainty is embedded in the stochastic nature of physical reality, and it 1s the reason 
that we must resort to probabilities 1n any predictive task. Deciding what level of probability 1s 
acceptable for a given action, say to treat a potential patient based on the estimated probability ofa 
disease, depends on the consequences of treating the disease or not, and this usually relies on humans, 
either the doctor or the patient, to make the final decision. There are many issues outside the domain 
that may influence such a decision. 


Human problem solving, although sharing many similarities, 1s the fundamental difference from 
machine problem solving. It 1s dependent on so many things, not least of which 1s the emotional and 
physical state, that is, the chemical and electrical bath a nervous system is enveloped in. Human 
thought is not a deterministic process, and this 1s actually a good thing because it enables us to solve 
problems in novel ways. Creative problem solving involves the ability to link disparate ideas or 
concepts. Often, the inspiration for this comes from an entirely irrelevant event, the proverbial 
Newton's apple. The ability of the human brain to knit these often random events of every day 
experience into some sort of coherent, meaningful structure is the illusive ability we aspire to build 
into our machines. 
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Summary 


There is no doubt that the hardest thing to do in machine learning is to apply it to unique, previously 
unsolved problems. We have experimented with numerous example models and used some of the most 
popular algorithms for machine learning. The challenge is now to apply this knowledge to important 
new problems that you care about. I hope this book has taken you some way as an introduction to the 
possibilities of machine learning with Python. 
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Part 3. Module 3 


Advanced Machine Learning with Python 


Leverage benefits of machine learning techniques using Python 


WOW! eBook 
www.wowebook.org 


Chapter 1. Unsupervised Machine Learning 


In this chapter, you will learn how to apply unsupervised learning techniques to identify patterns and 
structure within datasets. 


Unsupervised learning techniques are a valuable set of tools for exploratory analysis. They bring out 
patterns and structure within datasets, which yield information that may be informative 1n itself or 
serve as a guide to further analysis. It's critical to have a solid set of unsupervised learning tools that 
you can apply to help break up unfamiliar or complex datasets into actionable information. 


We'll begin by reviewing Principal Component Analysis (PCA), a fundamental data manipulation 
technique with a range of dimensionality reduction applications. Next, we will discuss k-means 
clustering, a widely-used and approachable unsupervised learning technique. Then, we will discuss 
Kohenen's Self-Organizing Map (SOM), a method of topological clustering that enables the 
projection of complex datasets into two dimensions. 


Throughout the chapter, we will spend some time discussing how to effectively apply these 
techniques to make high-dimensional datasets readily accessible. We will use the UCI Handwritten 
Digits dataset to demonstrate technical applications of each algorithm. In the course of discussing and 
applying each technique, we will review practical applications and methodological questions, 
particularly regarding how to calibrate and validate each technique as well as which performance 
measures are valid. To recap, then, we will be covering the following topics in order: 


e Principal component analysis 
e k-means clustering 
e Self-organizing maps 
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Principal component analysis 


In order to work effectively with high-dimensional datasets, it is important to have a set of techniques 
that can reduce this dimensionality down to manageable levels. The advantages of this dimensionality 
reduction include the ability to plot multivariate data in two dimensions, capture the majority of a 
dataset's informational content within a minimal number of features, and, in some contexts, identify 
collinear model components. 


Note 


For those in need of a refresher, collinearity in a machine learning context refers to model features 
that share an approximately linear relationship. For reasons that will likely be obvious, these features 
tend to be unhelpful as the related features are unlikely to add information mutually that either one 
provides independently. Moreover, collinear features may emphasize local minima or other false 
leads. 


Probably the most widely-used dimensionality reduction technique today 1s PCA. As we'll be 
applying PCA in multiple contexts throughout this book, it's appropriate for us to review the 
technique, understand the theory behind it, and write Python code to effectively apply it. 
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PCA-— a primer 


PCA is a powerful decomposition technique; 1t allows one to break down a highly multivariate 
dataset into a set of orthogonal components. When taken together in sufficient number, these 
components can explain almost all of the dataset's variance. In essence, these components deliver an 
abbreviated description of the dataset. PCA has a broad set of applications and its extensive utility 
makes it well worth our time to cover. 


Note 


Note the slightly cautious phrasing here—a given set of components of length less than the number of 
variables in the original dataset will almost always lose some amount of the information content 
within the source dataset. This lossiness 1s typically minimal, given enough components, but 1n cases 
where small numbers of principal components are composed from very high-dimensional datasets, 
there may be substantial lossiness. As such, when performing PCA, it is always appropriate to 
consider how many components will be necessary to effectively model the dataset in question. 


PCA works by successively identifying the axis of greatest variance in a dataset (the principal 
components). It does this as follows: 


Identifying the center point of the dataset. 

Calculating the covariance matrix of the data. 

Calculating the eigenvectors of the covariance matrix. 
Orthonormalizing the eigenvectors. 

Calculating the proportion of variance represented by each eigenvector. 


ei 


Let's unpack these concepts briefly: 


e Covariance is effectively variance applied to multiple dimensions; it is the variance between 
two or more variables. While a single value can capture the variance in one dimension or 
variable, it is necessary to use a 2 x 2 matrix to capture the covariance between two variables, a 
3 x 3 matrix to capture the covariance between three variables, and so on. So the first step in 
PCA is to calculate this covariance matrix. 

e An Eigenvector is a vector that is specific to a dataset and linear transformation. Specifically, it 
is the vector that does not change in direction before and after the transformation is performed. 
To get a better feeling for how this works, imagine that you're holding a rubber band, straight, 
between both hands. Let's say you stretch the band out until it 1s taut between your hands. The 
eigenvector is the vector that did not change direction between before the stretch and during it; in 
this case, it's the vector running directly through the center of the band from one hand to the other. 

e Orthogonalization is the process of finding two vectors that are orthogonal (at right angles) to 
one another. In an n-dimensional data space, the process of orthogonalization takes a set of 
vectors and yields a set of orthogonal vectors. 

e Orthonormalization is an orthogonalization process that also normalizes the product. 

e Eigenvalue (roughly corresponding to thedengthot the eigenvector) is used to calculate the 
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proportion of variance represented by each eigenvector. This is done by dividing the eigenvalue 
for each eigenvector by the sum of eigenvalues for all eigenvectors. 


In summary, the covariance matrix is used to calculate Eigenvectors. An orthonormalization process 
is undertaken that produces orthogonal, normalized vectors from the Eigenvectors. The eigenvector 
with the greatest eigenvalue is the first principal component with successive components having 
smaller eigenvalues. In this way, the PCA algorithm has the effect of taking a dataset and transforming 
it into a new, lower-dimensional coordinate system. 
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Employing PCA 


Now that we've reviewed the PCA algorithm at a high level, we're going to jump straight 1n and apply 
PCA to a key Python dataset—the UCI handwritten digits dataset, distributed as part of scikit-learn. 


This dataset is composed of /, 797 instances of handwritten digits gathered from 44 different writers. 
The input (pressure and location) from these authors' writing is resampled twice across an & x & grid 
so as to yield maps of the kind shown in the following image: 
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These maps can be transformed into feature vectors of length 64, which are then readily usable as 
analysis input. With an input dataset of 64 features, there 1s an immediate appeal to using a technique 
like PCA to reduce the set of variables to a manageable amount. As it currently stands, we cannot 
effectively explore the dataset with exploratory visualization! 


We will begin applying PCA to the handwritten digits dataset with the following code: 


import numpy as np 

from sklearn.datasets import load. digits 
LMpOre MabolLoriib.spyplLot as pilt 

from sklearn.decomposition import PCA 
from sklearn.preprocessing import scale 
from sklearn.lda import LDA 

import matplotlib.cm as cm 


GLG1ES = 1oOad. C1g1Cs () 
deta = d1gits,ddta 


i Samples, lM Téatures = Gatazsnape 
i Orgits = Jen(npsun1 que (digits. target) 
labels = digits.target 


This code does several things for us: 


1. First, it loads up a set of necessary libraries, including numpy, a set of components from scikit- 
learn, including the digits dataset itself, PCA and data scaling functions, and the plotting 
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2. The code then begins preparing the digits dataset. It does several things in order: 
o First, it loads the dataset before creating helpful variables 
o The data variable is created for subsequent use, and the number of distinct digits 1n the 
target vector (0 through to 9,son digits = 10) 1S saved as a variable that we can 
easily access for subsequent analysis 
o The target vector is also saved as labels for later use 
o All of this variable creation is intended to simplify subsequent analysis 
3. With the dataset ready, we can initialize our PCA algorithm and apply it to the dataset: 


pCa. — PCA(h Components=L)) 

Gate © = DPea~fiv(Gata) «trans lorm (Gata) 

print('explained variance ratio (first two components): @#s!' % 
Slt (pCa. explained: Variance ratio )) 

print('sum of explained variance (first two components): %s!' % 


Str (eum (pCa,explainmed Variance favuio: )):) 


4. This code outputs the variance explained by each of the first ten principal components ordered 
by explanatory power. 


In the case of this set of 10 principal components, they collectively explain 0.589 of the overall 
dataset variance. This isn't actually too bad, considering that it's a reduction from 64 variables to 10 
components. It does, however, illustrate the potential lossiness of PCA. The key question, though, is 
whether this reduced set of components makes subsequent analysis or classification easier to achieve; 
that 1s, whether many of the remaining components contained variance that disrupts classification 
attempts. 


Having created a data_r object containing the output of pca performed over the digits dataset, let's 
visualize the output. To do so, we'll first create a vector of colors for class coloration. We then 
simply create a scatterplot with colorized classes: 


X = np.arange(10) 
ys = [1it+xt+(i*x)**2 for 1 in range(10) |] 


pilt.figure () 


colors = cm.rainbow(np.linspace(0, 1, len(ys))) 

[Or Cy, 2 Target. name 2. ZAap(colors, ligt, e747 07 Or yor or eol;, babets)- 
Ple.SsCalter (data tllabels == 1, Ol, Ceba ew llebels ==] 2, ty 
c=c, alpha = 0.4) 
plt.legend () 


DPltstitletC’Scatterolot.of Portnus plotted in fire nH" 
"10 Principal Components') 
plt.show() 


The resulting scatterplot looks as follows: 
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Scatterplot of Points plotted in first 
2 Principal Components 





This plot shows us that, while there is some separation between classes 1n the first two principal 
components, it may be tricky to classify highly accurately with this dataset. However, classes do 
appear to be clustered and we may be able to get reasonably good results by employing a clustering 
analysis. In this way, PCA has given us some insight into how the dataset is structured and has 
informed our subsequent analysis. 


At this point, let's take this insight and move on to examine clustering by the application of the k- 
means clustering algorithm. 
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Introducing k-means clustering 


In the previous section, you learned that unsupervised machine learning algorithms are used to extract 
key structural or information content from large, possibly complex datasets. These algorithms do so 
with little or no manual input and function without the need for training data (sets of labeled 
explanatory and response variables needed to train an algorithm in order to recognize the desired 
classification boundaries). This means that unsupervised algorithms are effective tools to generate 
information about the structure and content of new or unfamiliar datasets. They allow the analyst to 
build a strong understanding in a fraction of the time. 
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Clustering — a primer 


Clustering 1s probably the archetypal unsupervised learning technique for several reasons. 


A lot of development time has been sunk into optimizing clustering algorithms, with efficient 
implementations available in most data science languages including Python. 


Clustering algorithms tend to be very fast, with smoothed implementations running in polynomial 
time. This makes it uncomplicated to run multiple clustering configurations, even over large datasets. 
Scalable clustering implementations also exist that parallelize the algorithm to run over TB-scale 
datasets. 


Clustering algorithms are frequently easily understood and their operation is thus easy to explain if 
necessary. 


The most popular clustering algorithm 1s k-means; this algorithm forms k-many clusters by first 
randomly initiating the clusters as k-many points in the data space. Each of these points is the mean of 
a cluster. An iterative process then occurs, running as follows: 


e Each point is assigned to a cluster based on the least (within cluster) sum of squares, which is 
intuitively the nearest mean. 

e The center (centroid) of each cluster becomes the new mean. This causes each of the means to 
shift. 


Over enough iterations, the centroids move into positions that minimize a performance metric (the 
performance metric most commonly used is the "within cluster least sum of squares" measure). Once 
this measure is minimized, observations are no longer reassigned during iteration; at this point the 
algorithm has converged on a solution. 
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Kick-starting clustering analysis 


Now that we've reviewed the clustering algorithm, let's run through the code and see what clustering 
can do for us: 


from time import time 
import numpy as np 
IMSOrL MabpLoriibepypDLol as ple 


np.random. seed () 


GHG d.ee =" koa, ngs.) 
data = scale(digits.data) 


i comp les, TW Teelures = Gara .cnape 
nh Gigits = Len(np.unigue (Gigits.targer) ) 
labels = digits.target 


Soup le S172 = 300 


Princ ("m. Gigits: 2d, \E mM samples <d,; \e mn teacures <d” 


e) 


© (1) Glgits, f. Samples, nf teatures)) 


print(79 * ' ') 


DErtae(*% 9s' & "4nit"! time inertia homo Compt v-meas ARI 
AMI silhouette') 


aef bench k means(estimator, name, data): 
tO = time () 
estimator.fit (data) 
DEINE ("% 9S Se2rs Gh Se5t Sao! Geot Gyr Ceo Gee” 

« (Neme, (Lame () — £0), SSstimacor.iMeriia , 
MeLE1Cs.MOmogeneaty score (labels, Sstimatoreiteabels ); 
MeLrEPCs.cCompleteness score (labels, Sstimetor. labels ); 
MeCrICe«Y Measure Score (labels, SSslimabor..abels Jy 
MeLELCS<eC Usted. rand Score (labels, SStimator.tapels J, 
MeLEICS.S11MOUCTLS SCOre (Gata, Slolimalor.itabels , 

metric='euclidean', 

PemOre: 31 7e—samole, S172) )) 


\O 


Note 


One critical difference between this code and the PCA code we saw previously 1s that this code 
begins by applying a scale function to the digits dataset. This function scales values in the dataset 
between 0 and J. It's critically important to scale data wherever needed, either on a log scale or 
bound scale, so as to prevent the magnitude of different feature values to have disproportionately 
powerful effects on the dataset. The key to determining whether the data needs scaling at all (and 
what kind of scaling 1s needed, within which ragge.and.so on) is very much tied to the shape and 
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nature of the data. If the distribution of the data shows outliers or variation within a large range, it 
may be appropriate to apply log-scaling. Whether this is done manually through visualization and 
exploratory analysis techniques or through the use of summary statistics, decisions around scaling are 
tied to the data under inspection and the analysis techniques to be used. A further discussion of 
scaling decisions and considerations may be found in Chapter 7, Feature Engineering Part II. 


Helpfully, scikit-learn uses the k-means++ algorithm by default, which improves over the original k- 
means algorithm in terms of both running time and success rate in avoiding poor clusterings. 


The algorithm achieves this by running an initialization procedure to find cluster centroids that 
approximate minimal variance within classes. 


You may have spotted from the preceding code that we're using a set of performance estimators to 
track how well our k-means application is performing. It isn't practical to measure the performance of 
a clustering algorithm based on a single correctness percentage or using the same performance 
measures that are commonly used with other algorithms. The definition of success for clustering 
algorithms is that they provide an interpretation of how input data 1s grouped that trades off between 
several factors, including class separation, in-group similarity, and cross-group difference. 


The homogeneity score is a simple, zero-to-one-bounded measure of the degree to which clusters 
contain only assignments of a given class. A score of one indicates that all clusters contain 
measurements froma single class. This measure is complimented by the completeness score, which 
is a Similarly bounded measure of the extent to which all members of a given class are assigned to the 
same cluster. As such, a completeness score and homogeneity score of one indicates a perfect 
clustering solution. 


The validity measure (v-measure) is a harmonic mean of the homogeneity and completeness scores, 
which is exactly analogous to the F-measure for binary classification. In essence, 1t provides a single, 
Q-1-scaled value to monitor both homogeneity and completeness. 


The Adjusted Rand Index (ARI) is a similarity measure that tracks the consensus between sets of 
assignments. As applied to clustering, it measures the consensus between the true, pre-existing 
observation labels and the labels predicted as an output of the clustering algorithm. The Rand index 
measures labeling similarity on a 0-/ bound scale, with one equaling perfect prediction labels. 


The main challenge with all of the preceding performance measures as well as other similar measures 
(for example, Akaike's mutual information criterion) is that they require an understanding of the 
eround truth, that 1s, they require some or all of the data under inspection to be labeled. If labels do 
not exist and cannot be generated, these measures won't work. In practice, this is a pretty substantial 
drawback as very few datasets come prelabeled and the creation of labels can be time-consuming. 


One option to measure the performance of a k-means clustering solution without labeled data is the 
Silhouette Coefficient. This is a measure of how well-defined the clusters within a model are. The 
Silhouette Coefficient for a given dataset is the mean of the coefficient for each sample, where this 
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coefficient 1s calculated as follows: 


hb-a 
s= , 


~ max (a,b) 


The definitions of each term are as follows: 


e a: The mean distance between a sample and all other points in the same cluster 
e 5: The mean distance between a sample and all other points 1n the next nearest cluster 


This score is bounded between -/ and /, with -/ indicating incorrect clustering, / indicating very 
dense clustering, and scores around 0 indicating overlapping clusters. This tends to fit our 
expectations of how a good clustering solution is composed. 


In the case of the digits dataset, we can employ all of the performance measures described here. As 
such, we'll complete the preceding example by initializing our bench k means function over the 
digits dataset: 


bench « Means (KMeans (1iit="K-Mmeansr:, MN CluSstero=m Clg1ts, Ante =10)., Mame="kK- 


meanst++", data=data) 
praia = Fo) 


This yields the following output (note that the random seed means your results will vary from mine!): 


: 18, n_samples 1797, n_features 64 


time inertia homo compl v-meas ARI AMI silhouette 
K-Means++ 8.255 69517 68.596 8.643 68.619 6.465 @.592 8.123 





Lets take a look at these results in more detail. 


The Silhouette score at 0.123 1s fairly low, but not surprisingly so, given that the handwritten digits 
data is inherently noisy and does tend to overlap. However, some of the other scores are not that 
impressive. The V-measure at 0.619 1s reasonable, but in this case 1s held back by a poor 
homogeneity measure, suggesting that the cluster centroids did not resolve perfectly. Moreover, the 
ARI at 0.465 1s not great. 


Note 


Let's put this in context. The worst case classification attempt, random assignment, would give at best 
10% classification accuracy. All of our performance measures would be accordingly very low. While 
we're definitely doing a lot better than that, were still trailing far behind the best computational 
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classification attempts. As we'll see in Chapter 4, Convolutional Neural Networks, convolutional 
nets achieve results with extremely low classification errors on handwritten digit datasets. We're 
unlikely to achieve this level of accuracy with traditional k-means clustering! 


All in all, it's reasonable to think that we could do better. 


To give this another try, we'll apply an additional stage of processing. To learn how to do this, we'll 
apply PCA—the technique we previously walked through—to reduce the dimensionality of our input 
dataset. The code to achieve this is very simple, as follows: 


pea = PCA Componentes —o d19G1 CS) .- itl Cate) 

bencn kK means (KMeans (ini t=pca~components » n-clusters—-10), 
name="PCA-based", 

Gata—data) 


This code simply applies pca to the digits dataset, yielding as many principal components as there 
are classes (in this case, digits). It can be sensible to review the output of pca before proceeding as 
the presence of any small principal components may suggest a dataset that contains collinearity or 
otherwise merits further inspection. 


This instance of clustering shows noticeable improvement: 


n digits: 189, n samples 1797, n_features 64 


init time inertia homo compl ov-meas ARI silhouette 


PCA-based 8.025 71820 0.6/5 @./15 8.693 8.56/ O.i21 





The V-measure and ARI have increased by approximately 0.08 points, with the V-measure reading a 
fairly respectable 0.693. The Silhouette Coefficient did not change significantly. Given the 
complexity and interclass overlap within the digits dataset, these are good results, particularly 
stemming from such a simple code addition! 


Inspection of the digits dataset with clusters superimposed shows that some meaningful clusters 
appear to have been formed. It 1s also apparent from the following plot that actually detecting the 
character from the input feature vectors may be a challenging task: 
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K-means clustering on the digits dataset with K = 10 
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Tuning your clustering configurations 


The previous examples described how to apply k-means, walked through relevant code, showed how 
to plot the results of a clustering analysis, and identified appropriate performance metrics. However, 

when applying k-means to real-world datasets, there are some extra precautions that need to be taken, 
which we will discuss. 


Another critical practical point is how to select an appropriate value for k. Initializing k-means 
clustering with a specific & value may not be harmful, but 1n many cases it 1s not clear initially how 
many clusters you might find or what values of & may be helpful. 


We can rerun the preceding code for multiple values of & in a batch and look at the performance 
metrics, but this won't tell us which instance of k is most effectively capturing structure within the 
data. The risk is that as k increases, the Silhouette Coefficient or unexplained variance may decrease 
dramatically, without meaningful clusters being formed. The extreme case of this would be if k =o, 
where o 1s the number of observations in the sample; every point would have its own cluster, the 
Silhouette Coefficient would be low, but the results wouldn't be meaningful. There are, however, 
many less extreme cases 1n which overfitting may occur due to an overly high k value. 


To mitigate this risk, it's advisable to use supporting techniques to motivate a selection of k. One 
useful technique 1n this context is the elbow method. The elbow method 1s a very simple technique; 
for each instance of k, plot the percentage of explained variance against k. This typically leads to a 
plot that frequently looks like a bent arm. 


For the PCA-reduced dataset, this code looks like the following snippet: 


import numpy as np 

from sklearn.cluster import KMeans 

from Skleari.cdatasets import 1Oad ci101ts 
from SClLpYy.Spaltlal,.distance ImMmpore calsT 
import matplotlib.pyplot as plt 

from sklearn.decomposition import PCA 
from sklearn.preprocessing import scale 


GiGites = oad Grits) 
data = scale(digits.data) 


Mn Samples, mM Teatures = Gdatassnape 
MW GCigics = J1den(np. Uni gue (G1g1ts.targer) ) 
labels = digits.target 


K = range(1,20) 

explainedvariance= [] 

for k in K: 
Fecuced data = PCA Comnponents—2Z).f12 Cranstorm (cata) 
kteans = AMeano (init = "k=meaners, 0 Clusters = ky, 1 121 = bb) 
kKmeens.t1e(reouced Cata) 
explainedvariance.append(sum(np.mWOW4Beok(reduced data, 
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kKneenss.ClLuSte: Centers , *eucilicean”), 2x15 = 
1))/data.shape[0]) 

pilt.plot(K, meandistortions, 'bx-') 
pilt.show () 


This application of the elbow method takes the pca reduction from the previous code sample and 
applies a test of the explained variance (specifically, a test of the variance within clusters). The result 
is Output as a measure of unexplained variance for each value of k in the range specified. In this case, 
as we're using the digits dataset (which we know to have ten classes), the range specified was 1 to 
20: 





The elbow method involves selecting the value of k that maximizes explained variance while 
minimizing kK; that 1s, the value of & at the crook of the elbow. The technical sense underlying this 1s 
that a minimal gain in explained variance at greater values of k 1s offset by the increasing risk of 
overfitting. 


Elbow plots may be more or less pronounced and the elbow may not always be clearly identifiable. 
This example shows a more gradual progression than may be observable in other cases with other 
datasets. It's worth noting that, while we know the number of classes within the dataset to be ten, the 
elbow method starts to show diminishing returns on é& increases almost immediately and the elbow is 
located at around five classes. This has a lot to do with the substantial overlap between classes, 
which we saw in previous plots. While there are ten classes, 1t becomes increasingly difficult to 
clearly identify more than five or so. 


With this in mind, it's worth noting that the elbow method is intended for use as a heuristic rather than 
as some kind of objective principle. The use of PCA as a preprocess to improve clustering 
performance also tends to smooth the graph, delivering a more gradual curve than otherwise. 


In addition to making use of the elbow method, it can be valuable to look at the clusters themselves, 
as we did earlier in the chapter, using PCA to reduce the dimensionality of the data. By plotting the 


dataset and projecting cluster assignation ontoMhw'delaght 1s sometimes very obvious when a k-means 
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implementation has fitted to a local minima or has overfit the data. The following plot demonstrates 
extreme overfitting of our previous k-means clustering algorithm to the digits dataset, artificially 
prompted by using K = 150. In this example, some clusters contain a single observation; there's really 
no way that this output would generalize to other samples well: 


K-means clustering on the digits dataset with K = 150 - demonstrative of overtitting 





Plotting the elbow function or cluster assignments 1s quick to achieve and straightforward to interpret. 
However, we've spoken of these techniques in terms of being heuristics. If a dataset contains a 
deterministic number of classes, we may not be sure that a heuristic method will deliver 
generalizable results. 


Another drawback is that visual plot checking is a very manual technique, which makes it poorly- 
suited for production environments or automation. In such circumstances, it's ideal to find a code- 
based, automatable method. One solid option in this case 1s v-fold cross-validation, a widely-used 
validation technique. 


Cross-validation is simple to undertake. To make it work, one splits the dataset into v parts. One of 
the parts is set aside individually as a test set. The model is trained against the training data, which is 
all parts except the test set. Let's try this now, again using the digits dataset: 


import numpy as np 

from skiGarm o2Mpeore Cross valrdeation 
from sklearn.cluster import KMeans 

fom, SsSkledrisdavtasevs 2Mport odd G1gits 
from sklearn.preprocessing import scale 


Ga0Gi1ts = JOad Cigitst) 
data = scale(digits.data) 


n. Samples, TN tearures = deva.shape 
i ODoTs: = Lemp. Unvque (G1010s.cargeu) ) 
labels = digits.target 
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kittens = KMeane (int 'K-mecnotr » DC lUSsters=m G1gitc, Wo Intern orgies) 


CY = Cross Validation. chutilespliiatin Samples; 2 1eer = 10, test size = 0.4, 
random state = Q) 
SCOres: = Cross Validation. cross Val Score (kmeans, cata, babels, Cv = CV, Scoring 


= “eC )USetee rand Score.) 
print (scores) 
print (sum(scores)/cv.n iter) 


This code performs some now familiar data loading and preparation and initializes the k-means 
clustering algorithm. It then defines cv, the cross-validation parameters. This includes specification 
of the number of iterations, n iter, and the amount of data that should be used in each fold. In this 
case, we're using 60% of the data samples as training data and 40% as test data. 


We then apply the k-means model and cv parameters that we've specified within the cross-validation 
scoring function and print the results as scores. Let's take a look at these scores now: 


[ 0.39276606 0.49571292 0.43933243 0.53573558 0.42459285 
0.55686854 0.4573401 0.49876358 0.50281585 0.4689295 ] 


0.4772857426 


This output gives us, 1n order, the adjusted Rand score for cross-validated, k-means+-+ clustering 
performed across each of the 10 folds 1n order. We can see that results do fluctuate between around 
0.4 and 0.55; the earlier ARI score for k-means++ without PCA fell within this range (at 0.465). 
What we've created, then, is code that we can incorporate into our analysis in order to check the 
quality of our clustering automatically on an ongoing basis. 


As noted earlier in this chapter, your choice of success measure 1s contingent on what information you 
already have. In most cases, you won't have access to ground truth labels from a dataset and will be 
obliged to use a measure such as the Silhouette Coefficient that we discussed previously. 


Note 


Sometimes, even using both cross-validation and visualizations won't provide a conclusive result. 
Especially with unfamiliar datasets, it's not unheard of to run into issues where some noise or 
secondary signal resolves better at a different k value than the signal you're attempting to analyze. 


As with every other algorithm discussed in this book, it is imperative to understand the dataset one 
wishes to work with. Without this insight, it's entirely possible for even a technically correct and 
rigorous analysis to deliver inappropriate conclusions. Chapter 6, 7ext Feature Engineering will 
discuss principles and techniques for the inspection and preparation of unfamiliar datasets more 


thoroughly. 
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Self-organizing maps 


A SOM 1s a technique to generate topological representations of data in reduced dimensions. It is one 
of a number of techniques with such applications, with a better-known alternative being PCA. 
However, SOMs present unique opportunities, both as dimensionality reduction techniques and as a 


visualization format. 


WOW! eBook 
www.wowebook.org 


SOM — a primer 


The SOM algorithm involves iteration over many simple operations. When applied at a smaller scale, 
it behaves similarly to k-means clustering (as we'll see shortly). At a larger scale, SOMs reveal the 
topology of complex datasets 1n a powerful way. 


An SOM 1s made up of a grid (commonly rectangular or hexagonal) of nodes, where each node 
contains a weight vector that is of the same dimensionality as the input dataset. The nodes may be 
initialized randomly, but an initialization that roughly approximates the distribution of the dataset will 
tend to train faster. 


The algorithm iterates as observations are presented as input. Iteration takes the following form: 


e Identifying the winning node in the current configuration—the Best Matching Unit (BMU). The 
BMU 1s identified by measuring the Euclidean distance in the data space of all the weight 
vectors. 

e The BMU is adjusted (moved) towards the input vector. 

e Neighboring nodes are also adjusted, usually by lesser amounts, with the magnitude of 
neighboring movement being dictated by a neighborhood function. (Neighborhood functions vary. 
In this chapter, we'll use a Gaussian neighborhood function. ) 


This process repeats over potentially many iterations, using sampling if appropriate, until the network 
converges (reaching a position where presenting a new input does not provide an opportunity to 
minimize loss). 


A node 1n an SOM 1s not unlike that of a neural network. It typically possesses a weight vector of 
length equal to the dimensionality of the input dataset. This means that the topology of the input 
dataset can be preserved and visualized through a lower-dimensional mapping. 


The code for this SOM class implementation is available in the book repository in the som. py script. 
For now, let's start working with the SOM algorithm in a familiar context. 
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Employing SOM 


As discussed previously, the SOM algorithm is iterative, being based around Euclidean distance 
comparisons of vectors. 


This mapping tends to form a fairly readable 2D grid. In the case of the commonly-used Iris tutorial 
dataset, an SOM will map it out pretty cleanly: 





In this diagram, the classes have been separated and also ordered spatially. The background coloring 
in this case is a clustering density measure. There is some minimal overlap between the blue and 
ereen classes, where the SOM performed an imperfect separation. On the Iris dataset, an SOM will 
tend to approach a converged solution on the order of 100 iterations, with little visible improvement 
after 1,000. For more complex datasets containing less clearly divisible cases, this process can take 
tens of thousands of iterations. 


Awkwardly, there aren't implementations of the SOM algorithm within pre-existing Python packages 
like scikit-learn. This makes it necessary for us to use our own implementation. 


The SOM code we'll be working with for this purpose 1s located in the associated GitHub repository. 
For now, let's take a look at the relevant script and get an understanding of how the code works: 


import numpy as np 

rirom, skleamm.datasels Import load..d101ts 

from som import Som 

from pylab import plot,axis,show,pcolor,colorbar,bone 


GdiGits = choad cigs) 
date = Ci1gits.data 
labels = digits.target 


At this point, we've loaded the digits dataseW@d ¢Wesitified labels as a Separate set of data. Doing 
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this will enable us to observe how the SOM algorithm separates classes when assigning them to map: 


Som. = Som(1o, 16, 64,Si19ma—-1.0,; learning *+ate=0 5) 
som.random weights init (data) 

Prine (Lit tiaring 20M." 

SOm.tiaim random(date, 10000) 

print("\n. SOM Processing Complete") 


bone () 
Pcolor (som.distance. map() <7) 
colorbar () 


At this point, we have utilized a som class that is provided in a separate file, Som. py, in the 
repository. This class contains the methods required to deliver the SOM algorithm we discussed 
earlier in the chapter. As arguments to this function, we provide the dimensions of the map (After 
trialing a range of options, we'll start out with 16 x 16 in this case—this grid size gave the feature 
map enough space to spread out while retaining some overlap between groups.) and the 
dimensionality of the input data. (This argument determines the length of the weight vector within the 
SOM's nodes.) We also provide values for sigma and learning rate. 


Sigma, in this case, defines the spread of the neighborhood function. As noted previously, we're using 
a Gaussian neighborhood function. The appropriate value for sigma varies by grid size. For an& x 8 
erid, we would typically want to use a value of /.0 for Sigma, while in this case we're using /.3 fora 
16x 16 grid. Itis fairly obvious when one's value for sigma 1s off; 1f the value is too small, values 
tend to cluster near the center of the grid. If the values are too large, the grid typically ends up with 
several large, empty spaces towards the center. 


The learning rate self-explanatorily defines the initial learning rate for the SOM. As the map 
continues to iterate, the learning rate adjusts according to the following function: 


learning rate(t) = learning rate/ (1 +t/(0.5* t) | 


Here, ¢ is the iteration index. 


We follow up by first initializing our SOM with random weights. 
Note 


As with k-means clustering, this initialization method is slower than initializing based on an 
approximation of the data distribution. A preprocessing step similar to that employed by the k- 
means++ algorithm would accelerate the SOM's runtime. Our SOM runs sufficiently quickly over the 
digits dataset to make this optimization unnecessary for now. 


Next, we set up label and color assignations for_each class, so that we can distinguish classes on the 
ri ogk 
plotted SOM. Following this, we iterate t ala point. 


On each iteration, we plot a class-specific marker for the BMU as calculated by our SOM algorithm. 


When the SOM finishes iteration, we add a U-Matrix (a colorized matrix of relative observation 
density) as a monochrome-scaled plot layer: 


| 
| 
ODN AO BF WNHE OC 


labels 
labels 
labels 
labels 
labels 
labels 
labels 
labels 
labels 
labels 


labels 
labels 
labels 
labels 
labels 
labels 
labels 
labels 
labels 
labels 


| | 
OO’ aj Oy, Ol a GN) Es © 


markers = ['o*, "vt, "li, *3"y% YS, *S*y “Dy "Ry © Dy RT] 
COlLore. = i'r, at, “i, ~V", “Cr, FOyUedla veo), tiple), “iygdzy Uo > im’, 
(Q.4,0.6,0) ] 
for cnt,xXx in enumerate (data): 
w = som.winner (xx) 
plot (w[0]+.5,w[1l]+.5,markers[labels[cnt]], 
markerfacecolor='None', markeredgecolor=colors[labels[cnt]], 
markersize=12, markeredgewidth=2) 
axis([0,som.weights.shape[0],0,som.weights.shapel[1]]) 
show () 


This code generates a plot similar to the following: 





This code delivers a 16 x 16 node SOM plot. As we can see, the map has done a reasonably good job 
of separating each cluster into topologically distinct areas of the map. Certain classes (particularly 
the digits five in cyan circles and nine in green stars) have been located over multiple parts of the 
SOM space. For the most part, though, each class occupies a distinct region and it's fair to say that the 
SOM has been reasonably effective. The U-Matrlx SHOWS that regions with a high density of points 


ww.wowe 


are co-habited by data from multiple classes. This isn't really a surprise as we saw similar results 
with k-means and PCA plotting. 
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Further reading 


Victor Powell and Lewis Lehe provide a fantastic interactive, visual explanation of PCA at 


http://setosa.10/ev/principal-component-analysis/, this is ideal for readers who are new to the core 


concepts of PCA or who are not quite getting it. 


For a lengthier and more mathematically-involved treatment of PCA, touching on underlying matrix 
transformations, Jonathon Shlens from Google research provides a clear and thorough explanation at 


http://arxiv.org/abs/1404.1100. 


For a thorough worked example that translates Jonathon's description into clear Python code, consider 
Sebastian Raschka's demonstration using the Iris dataset at 


http://sebastianraschka.com/Articles/2015 pca_ in 3 steps.html. 


ines consider the sklearn documentation for more details on arguments to the PCA class at 





For a lively and expert treatment of k-means, including detailed investigations of the conditions that 
cause it to fail, and potential alternatives in such cases, consider David Robinson's fantastic blog, 


variance explained at http://varianceexplained.org/r/kmeans-free-lunch/. 


A specific discussion of the Elbow method 1s provided by Rick Gove at 
https://bl.ocks.org/rpgove/0060ff3b656618e9136b. 


Finally, consider sklearn's documentation for another view on unsupervised learning algorithms, 
including k-means at http://scikit- 
learn.org/stable/tutorial/statistical inference/unsupervised_ learning. html. 





Much of the existing material on Kohonen's SOM is either rather old, very high-level, or formally 
expressed. A decent alternative to the description in this book is provided by John Bullinaria at 


http://www.cs.bham.ac.uk/~jxb/NN/I16.pdf- 


For readers interested in a deeper understanding of the underlying mathematics, I'd recommend 
reading the work of Tuevo Kohonen directly. The 2012 edition of self-organising maps is a great 
place to start. 


The concept of multicollinearity, referenced in the chapter, 1s given a clear explanation for the 
unfamiliar at https://onlinecourses.science.psu.edu/stat50 1/node/344. 
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Summary 


In this chapter, we've reviewed three techniques with a broad range of applications for preprocessing 
and dimensionality reduction. In doing so, you learned a lot about an unfamiliar dataset. 


We started out by applying PCA, a widely-utilized dimensionality reduction technique, to help us 
understand and visualize a high-dimensional dataset. We then followed up by clustering the data using 
k-means clustering, identifying means of improving and measuring our k-means analysis through 
performance metrics, the elbow method, and cross-validation. We found that k-means on the digits 
dataset, taken as is, didn't deliver exceptional results. This was due to class overlap that we spotted 
through PCA. We overcame this weakness by applying PCA as a preprocess to improve our 
subsequent clustering results. 


Finally, we developed an SOM algorithm that delivered a cleaner separation of the digit classes 
than PCA. 


Having learned some key basics around unsupervised learning techniques and analytical 
methodology, let's dive into the use of some more powerful unsupervised learning algorithms. 
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Chapter 2. Deep Belief Networks 


In the preceding chapter, we looked at some widely-used dimensionality reduction techniques, which 
enable a data scientist to get greater insight into the nature of datasets. 


The next few chapters will focus on some more sophisticated techniques, drawing from the area of 
deep learning. This chapter is dedicated to building an understanding of how to apply the Restricted 
Boltzmann Machine (RBM) and manage the deep learning architecture one can create by chaining 
RBMs—the deep belief network (DBN). DBNs are trainable to effectively solve complex problems 
in text, image, and sound recognition. They are used by leading companies for object recognition, 
intelligent image search, and robotic spatial recognition. 


The first thing that we're going to do is get a solid grounding in the algorithm underlying DBN; unlike 
clustering or PCA, this code isn't widely-known by data scientists and we're going to review it in 
some depth to build a strong working knowledge. Once we've worked through the theory, we'll build 
upon it by stepping through code that brings the theory into focus and allows us to apply the technique 
to real-world data. The diagnosis of these techniques is not trivial and needs to be rigorous, so we'll 
emphasize the thought processes and diagnostic techniques that enable us to effectively watch and 
control the success of your implementation. 


By the end of this chapter, you'll understand how the RBM and DBN algorithms work, know how to 
use them, and feel confident in your ability to improve the quality of the results you get out of them. To 
summarize, the contents of this chapter are as follows: 


e Neural networks — a primer 
e Restricted Boltzmann Machines 
e Deep belief networks 
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Neural networks — a primer 


The RBM 1s a form of recurrent neural network. In order to understand how the RBM works, it 1s 
necessary to have a more general understanding of neural networks. Readers with an understanding of 
artificial neural network (hereafter neural network, for the sake of simplicity) algorithms will find 
familiar elements 1n the following description. 


There are many accounts that cover neural networks 1n great theoretical detail; we won't go into great 
detail retreading this ground. For the purposes of this chapter, we will first describe the components 
of a neural network, common architectures, and prevalent learning processes. 
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The composition of a neural network 


For unfamiliar readers, neural networks are a class of mathematical models that train to produce and 
optimize a definition for a function (or distribution) over a set of input features. The specific 
objective of a given neural network application can be defined by the operator using a performance 
measure (typically a cost function); in this way, neural networks may be used to classify, predict, or 
transform their inputs. 


The use of the word neural in neural networks is the product of a long tradition of drawing from 
heavy-handed biological metaphors to inspire machine learning research. Hence, artificial neural 
networks algorithms originally drew (and frequently still draw) from biological neuronal structures. 


A neural network 1s composed of the following elements: 


e A learning process: A neural network learns by adjusting parameters within the weight function 
ofits nodes. This occurs by feeding the output of a performance measure (as described 
previously, in supervised learning contexts this 1s frequently a cost function, some measure of 
inaccuracy relative to the target output of the network) into the learning function of the network. 
This learning function outputs the required weight adjustments (Technically, it typically 
calculates the partial derivatives—terms required by gradient descent.) to minimize the cost 
function. 

e Aset of neurons or weights: Each contains a weight function (the activation function) that 
manipulates input data. The activation function may vary substantially between networks (with 
one well-known example being the hyperbolic tangent). The key requirement is that the weights 
must be adaptive, that is,, adjustable based on updates from the learning process. In order to 
model non-parametrically (that is, to model effectively without defining details of the 
probability distribution), it is necessary to use both visible and hidden units. Hidden units are 
never observed. 

e Connectivity functions: They control which nodes can relay data to which other nodes. Nodes 
may be able to freely relay input to one another 1n an unrestricted or restricted fashion, or they 
may be more structured 1n layers through which input data must flow ina directed fashion. There 
is a broad range of interconnection patterns, with different patterns producing very different 
network properties and possibilities. 


Utilizing this set of elements enables us to build a broad range of neural networks, ranging from the 
familiar directed acyclic graph (with perhaps the best-known example being the Multi-Layer 
Perceptron (MLP)) to creative alternatives. The Self-Organizing Map (SOM) that we employed in 
the preceding chapter was a type of neural network, with a unique learning process. The algorithm 
that we'll examine later 1n this chapter, that of the RBM, is another neural network algorithm with 
some unique properties. 
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Network topologies 


There are many variations on how the neurons in a neural network are connected, with structural 
decisions being an important factor in determining the network's learning capabilities. Common 
topologies in unsupervised learning tend to differ from those common to supervised learning. One 
common and now familiar unsupervised learning topology is that of the SOM that we discussed in the 
last chapter. 


The SOM, as we saw, directly projects individual input cases onto a weight vector contained by each 
node. It then proceeds to reorder these nodes until an appropriate mapping of the dataset 1s converged 
on. The actual structure of the SOM was a variant based on the details of training, specific outcome of 
a given instance of training, and design decisions taken 1n structuring the network, but square or 
hexagonal grid structures are becoming increasingly common. 


A very common topology type 1n supervised learning is that of a three-layer, feedforward network, 
with the classical case being the MLP. In this network topology model, the neurons 1n the network are 
split into layers, with each layer communicating to the layer "beyond" it. The first layer contains 
inputs that are fed to a hidden layer. The hidden layer develops a representation of the data using 
weight activations (with the right activation function, for example, sigmoid or gauss, an MLP can act 
as a universal function approximator) and activation values are communicated to the output layer. The 
output layer typically delivers network results. This topology, therefore, looks as follows: 


Hidden 
Layer 


Qutput 





Other network topologies deliver different capabilities. The topology of a Boltzmann Machine, for 
instance, differs from those described previously. The Boltzmann machine contains hidden and 
visible neurons, like those of a three-layer netyyork, Dut all of these neurons are connected to one 
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another in a directed, cyclic graph: 





This topology makes Boltzmann machines stochastic—probabilistic rather than deterministic—and 
able to develop in one of several ways given a sufficiently complex problem. The Boltzmann machine 
is also generative, which means that it is able to fully (probabilistically) model all of the input 
variables, rather than using the observed variables to specifically model the target variables. 


Which network topology 1s appropriate depends to a large extent on your specific challenge and the 
desired output. Each tends to be strong in certain areas. Furthermore, each of the topologies described 
here will be accompanied by a learning process that enables the network to iteratively converge on an 
(ideally optimal) solution. 


There are a broad range of learning processes, with specific processes and topologies being more or 
less compatible with one another. The purpose of a learning process is to enable the network to adjust 
its weights, iteratively, in such a way as to create an increasingly accurate representation of the input 
data. 


As with network topologies, there are a great many learning processes to consider. Some familiarity 
is assumed and a great many excellent resources on learning processes exist (some good examples 
are given at the end of this chapter). This section will focus on delivering a common characterization 
of learning processes, while later in the chapter, we'll look 1n greater detail at a specific example. 
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As noted, the objective of learning in a neural network 1s to iteratively improve the distribution of 
weights across the model so that it approximates the function underlying input data with increasing 
accuracy. This process requires a performance measure. This may be a classification error measure, 
as 1s commonly used in supervised, classification contexts (that is, with the backpropagation learning 
algorithm in MLP networks). In stochastic networks, it may be a probability maximization term (such 
as energy in energy-based networks). 


In either case, once there is a measure to increase probability, the network is effectively attempting to 
reduce that measure using an optimization method. In many cases, the optimization of the network is 
achieved using gradient descent. As far as the gradient descent algorithm method is concerned, the 
size of your performance measure value on a given training iteration is analogous to the slope of your 
eradient. Minimizing the performance measure is therefore a question of descending that gradient to 
the point at which the error measure 1s at its lowest for that set of weights. 


The size of the network's updates for the next iteration (the learning rate of your algorithm) may be 
influenced by the magnitude of your performance measure, or it may be hard-coded. 


The weight updates by which your network adjusts may be derived from the error surface itself; if so, 
your network will typically have a means of calculating the gradient, that 1s, deriving the values to 
which updates need to adjust the parameters on your network's activated weight functions so as to 
continue to reduce the performance measure. 


Having reviewed the general concepts underlying network topologies and learning methods, let's 
move into the discussion of a specific neural network, the RBM. As we'll see, the RBM 1s a key part 
of a powerful deep learning algorithm. 
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Restricted Boltzmann Machine 


The RBM 1s a fundamental part of this chapter's subject deep learning architecture—the DBN. The 
following sections will begin by introducing the theory behind an RBM, including the architectural 
structure and learning processes. 


Following that, we'll dive straight into the code for an RBM class, making links between the 
theoretical elements and functions in code. We'll finish by touching on the applications of RBMs and 
the practical factors associated with implementing an RBM. 
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Introducing the RBM 


A Boltzmann machine 1s a particular type of stochastic, recurrent neural network. It 1s an energy- 
based model, which means that it uses an energy function to associate an energy value with each 
configuration of the network. 


We briefly discussed the structure of a Boltzmann machine in the previous section. As mentioned, a 
Boltzmann machine 1s a directed cyclic graph, where every node 1s connected to all other nodes. This 
property enables it to model in a recurrent fashion, such that the model's outputs evolve and can be 
viewed over time. 


The learning loop in a Boltzmann machine involves maximizing the probability of the training dataset, 
X. As noted, the specific performance measure used is energy, which 1s characterized as the negative 
log of the probability for a dataset_X, given a vector of model parameters, 0. This measure is 
calculated and used to update the network's weights in such a way as to minimize the free energy 1n 
the network. 


The Boltzmann machine has seen particular success 1n processing image data, including photographs, 
facial features, and handwriting classification contexts. 


Unfortunately, the Boltzmann machine 1s not practical for more challenging ML problems. This is due 
to the fact that there are challenges with the machine's ability to scale; as the number of nodes 
increases, the compute time grows exponentially, eventually leaving us 1n a position where we're 
unable to compute the free energy of the network. 


Note 


For those with an interest in the underlying formal reasoning, this happens because the probability of 
a data point, x, p(x; O), must integrate to / over all x. Achieving this requires that we use a partition 
function, Z, used as a normalizing constant. (Z 1s a constant such that multiplying a non-negative 
function by Z will make the non-negative function integrate to / over all inputs; 1n this case, over all 


x.) 


The probability model function 1s a function of a set of normal distributions. In order to get the energy 
for our model, we need to differentiate for each of the model's parameters; however, this becomes 
complicated because of the partition function. Each model parameter produces equations dependent 
on other model parameters and we ultimately find ourselves unable to calculate the energy without 
(potentially) hugely expensive calculations, whose cost increases as the network scales. 


In order to overcome the weaknesses of the Boltzmann machine, it is necessary to make adjustments 
to both the network topology and training process. 


Topology 


The main topological change that delivers effitfem htiprovements is the restriction of connectivity 
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between nodes. First, one must prevent connection between nodes within the same layer. 
Additionally, all skip-layer connections (that 1s, direct connections between non-consecutive layers) 
must be prevented. A Boltzmann machine with this architecture is referred to as an RBM and appears 
as shown in the following diagram: 





One advantage of this topology is that the hidden and visible layers are conditionally independent 
given one another. As such, it is possible to sample from one layer using the activations of the other. 


Training 


We observed previously that, for Boltzmann machines, the training time of the machine scales 
extremely poorly as the machine is scaled up to additional nodes, putting us 1n a position where we 
cannot evaluate the energy function that we're attempting to use in training. 


The RBM is typically trained using a procedure with a different learning algorithm at its heart, the 
Permanent Contrastive Divergence (PCD) algorithm, which provides an approximation of 
maximum likelihood. PCD doesn't evaluate the energy function itself, but instead allows us to 
estimate the gradient of the energy function. With this information, we can proceed by making very 
small adjustments in the direction of the steepest gradient via which we may progress, as desired, 
toward the local minimum. 


The PCD algorithm 1s made up of two phases. These are referred to as the positive and negative 
phases, and each phase has a corresponding effect on the energy of the model. The positive phase 
increases the probability of the training dataset, X, thus reducing the energy of the model. Following 
this, the negative phase uses a sampling approach from the model to estimate the negative phase 
gradient. The overall effect of the negative phase is to decrease the probability of samples generated 
by the model. 


Sampling in the negative phase and throughout the update process 1s achieved using a form of 


sampling called Gibbs sampling. WOW! eBook 
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Note 


Gibbs sampling is a variant of the Markov Chain Monte Carlo (MCMC) family of algorithms, and 
samples from an approximated multivariate probability distribution. What this means 1s, rather than 
using a summed calculation in building our probabilistic model (just as we might do, for instance, 
when we flip a coin a certain number of times; 1n such cases, we may sum the number of heads 
attempts as a proportion of the sum of all attempts), we approximate the value of an integral instead. 
The subject of how to create a probabilistic model by approximating an integral deserves more time 
than this book can give it. As such the Further reading section of this chapter provides an excellent 
paper reference. The key points to bear in mind for now (and stripping out a lot of important detai1!) 
are that, instead of summing each case exactly once, we sample based on the (often non-uniform) 
distribution of the data 1n question. Gibbs sampling 1s a probabilistic sampling method for each 
parameter 1n a model, based on all of the other parameter values in that model. As soon as a new 
parameter value is obtained, it is immediately used 1n sampling calculations for other parameters. 


Some of you may be asking at this point why PCD 1s necessary. Why not use a more familiar method, 
such as gradient descent with line search? To put it simply, we cannot easily calculate the free energy 
of our network as this calculation involves an integration across all the network's nodes. We 
recognized this limitation when we called out the big weakness of the Boltzmann machine—that the 
compute time grows exponentially as the number of nodes increases, leaving us 1n a situation where 
we're trying to minimize a function whose value we cannot calculate! 


What PCD provides is a way to estimate the gradient of the energy function. This enables an 
approximation of the network's free energy, which is fast enough to be viable for application and has 
shown to be generally accurate. (Refer to the Further reading section for a performance 
comparison. ) 


AS we Saw previously, the RBM's probability model function 1s the joint distribution of our model 
parameters, making Gibbs sampling appropriate! 


The training loop in an initialized RBM involves several steps: 


1. We obtain the current iteration's activated hidden layer weight values. 

2. We perform the positive phase of PCD, using the state of the Gibbs chain from the previous 
iteration as input. 

3. We perform the negative phase of PCD using the pre-existing state of the Gibbs chain. This gives 
us the free energy value. 

4. We update the activated weights on the hidden layer using the energy value we've calculated. 


This algorithm allows the RBM to iteratively step toward a decreased free energy value. The RBM 
continues to train until both the probability of the training dataset integrates to one and free energy 1s 
equal to zero, at which point the RBM has converged. 


Now that we've had a chance to review the RBM ¢6palogy and training process, let's apply the 
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algorithm to classify a substantial real dataset. 
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Applications of the RBM 


Now that we have a general working knowledge of the RBM algorithm, let's walk through code to 
create an RBM. We'll be working with an RBM class that will allow us to classify the MNIST 
handwritten digits dataset. The code we're about to review does the following: 


e It sets up the initial parameters of an RBM, including layer size, shareable bias vectors, and 
shareable weight matrix for connectivity with external network structures (this enables deep 
belief networks) 

e It defines functions for communication and inference between hidden and visible layers 

e It defines functions that allow us to update the parameters of network nodes 

e It defines functions that handle efficient sampling for the learning process, using PCD-k to 
accelerate sampling (making it possible to compute in a reasonable frame of time) 

e It defines functions that compute the free energy of the model (used to calculate the gradient 
required for PCD-k updates) 

e It identifies the Psuedo-Likelihood (PL), usable as a log-likelihood proxy to guide the selection 
of appropriate hyperparameters 


Let's begin examining our RBM class: 


class RBM(object): 
Ger 62nie. 4 

self, 
input=None, 
Mn Visiole=7384, 
if Hlocen=500, 
w=None, 
hbias=None, 
vbias=None, 
numpy rng=None, 
theano rng=None 


i 


The first element that we need to build 1s an RB™ constructor, which we can use to define the 
parameters of the model, such as the number of visible and hidden nodes (n visible and n hidden) 
as well as additional parameters that can be used to adjust how the RBM's inference functions and 
CD updates are performed. 


The w parameter can be used as a pointer to a shared weight matrix. This becomes more relevant 
when implementing a DBN, as we'll see later in the chapter; in such architectures, the weight matrix 
needs to be shared between different parts of the network. 


The hbias and vbias parameters are used similarly as optional references to shared hidden and 
visible (respectively) units’ bias vectors. Again, these are used in DBNs. 


The input parameter enables the RBM to be connected, top-to-tail, to other graph elements. This 


allows one to, for instance, chain RBMs. — WOW! eBook 
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Having set up this constructor, we next need to flesh out each of the preceding parameters: 


Scilla Vieitole = 2 Vise 
SeLisn. Ntdeet = 2 hieden 


if HNUMpy tng os None: 
numpy rng = numpy.random.RandomState (1234) 


if theeno tng 1S None: 
theano: rng = Randomsctreams (numpy rng.frendgint tz ** 30) ) 


This 1s fairly straightforward stuff; we set the visible and hidden nodes for our RBM and set up two 
random number generators. The theano rng parameter will be used later in our code to sample from 
the RBM's hidden units: 


1f W is None: 
initial W = numpy.asarray ( 
nuMpyY 1n¢.Uun.atorm ( 
low=-4 * numpy.sqrt(6. / (n hidden + n visible)), 
high=4 * numpy.sgqrt(6. / (n hidden + n visible)), 
SIZe-—(i Visible, 1. nuccen) 


)y 
dtype=theano.config.floatx 


This code switches up the data type for w so that it can be run over the GPU. Next, we set up shared 
variables using theano. shared, which allows a variable's storage to be shared between functions 
that it appears in. Within the current example, the shared variables that we create will be the weight 
vector (w) and bias variables for hidden and visible units (hbias and vbias, respectively). When we 
move on to creating deep networks with multiple components, the following code will allow us to 
share components between parts of our networks: 


W = theano.shared(value=initial W, name='W', borrow=True) 


1f hbias is None: 


hbias = theano.shared ( 
value=numpy. zeros ( 
a WaOOei; 


dtype=theano.config.floatx 
)y 
name='hbias', 
borrow=True 


1f£f vbias is None: 
vbias = theano.shared ( 
value=numpy. zeros ( 
A Visio le; 
dtype=theano.config.floatx 
)y 
name='vbias', 
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) 


At this point, we're ready to initialize the input layer as follows: 
self.input = input 
me MOG TApue: 


self.input = T.matrix('input') 


self.w = W 


self.hbias = hbias 

self.vbias = vbias 

Seltstheano rng = theano. 2ag 

self.params = [self.W, self.hbias, self.vbias] 


As we now have an initialized input layer, our next task 1s to create the symbolic graph that we 
described earlier in the chapter. Achieving this 1s a matter of creating functions to manage the 
interlayer propagation and activation computation operations of the network: 


def propup(self, vis): 
Dre SlOMO1G, activation = T.dotivis, Ssell«W) + SelLt.nbias 
PelLUrn [pre -S10mMo10 @aCllVvation, T.nner,sromol1d (pre Srgmoid eclivarion) | 


def propdown(self, hid): 
Pfe_ Sigmoid activation — T.doL(ita, selit.Ww.T) + Séli.vbias 
Feturn [pre Sl9gmo1lGd aCcClivation, TsnnetsSslomo1d(pre Si19mo10 activation ).| 


These two functions pass the activation of one layer's units to the other layer. The first function passes 
the visible units' activation upward to the hidden units so that the hidden units can compute their 
activation conditional on a sample of the visible units. The second function does the reverse— 
propagating the hidden layer's activation downward to the visible units. 


It's probably worth asking why we're creating both propup and propdown. As we reviewed it, PCD 
only requires that we perform sampling from the hidden units. So what's the value of propup? 


In a nutshell, sampling from the visible layer becomes useful when we want to sample from the RBM 
to review its progress. In most applications where our RBM 1s processing visual data, it is 
immediately valuable to periodically take the output of sampling from the visible layer and plot it, as 
shown in the following example: 
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As we can see here, over the course of iteration, our network begins to change its labeling; in the first 
case, 7 morphs into 9, while elsewhere 9 becomes 6 and the network gradually reaches a definition of 
3-ness. 


As we discussed earlier, it's helpful to have as many views on the operation of your RBM as possible 
to ensure that it's delivering meaningful results. Sampling from the outputs 1t generates 1s one way to 
improve this visibility. 


Armed with information about the visible layer's activation, we can deliver a sample of the unit 
activations from the hidden layer, given the activation of the hidden nodes: 


Gér sample bh Gaven vwiselt, v0 sample): 


pre. S10m0i10. nk, DO. Mean = Selitepropup(ve sample) 
hl sample = self .theano rng.binomial(size—-hl meansshape, 
n=l, p=nl Mean; Geype=-Eheano.~contig.floatx) 


fSturn {pre sigmoid hi, Hl mean, Hl sample! 


Likewise, we can now sample from the visible layer given hidden unit activation information: 


def sample v_ given h(self, hO sample): 

pre sigmoid Vi, Vi mean. = selr.spropoown(ho sample) 
Vi sample: = Selt.theano. (NG vbinctial(size-v1l Meanashape, 
N=ly Pavl Mean, GCype-thieano.coni1g,.  loacx) 


Leturn pre Siomo1d vil, vi mean; vil Sample] 


We've now achieved the connectivity and update loop required to perform a Gibbs sampling step, as 
described earlier in this chapter. Next, we should define this sampling step! 


Ger Gibbs hvnitselt, nO Sample): 


pre sigmoid vl; vi. mean, vl. Sample = 
self.sample v_ given h(hQ sample) 
Dre -s10mo1d Hl, Dl meen, Ol Sample 
Seelt.~sample 2. 91Vven Vivi. Sample) 
return [pre sigmoid vl, vl _ mean, vl sample, 
pre sigmoid hi, hi. mean, DL sample] 


As discussed, we need a similar function to sample from the visible layer: 
cet. Q1bbs vav(selt, v0. Sample) 


pre s190moid. Hl, Dl mean, HL Sample 

self.sample hh oiven viv0._ sample) 

pre sigmoid vl, vl _ mean, vl sample = 

selt.sample v given. (at sample) 

return [pre_sigmoid_hl, h1_m@e@ hw eBdo sample, 
pre_ sigmoid _ vl, vViwwiewewebdoléergp le] 


The code that we've written so far gives us some of our model. It set up the nodes and layers and 
connections between layers. We've written the code that we need 1n order to update the network 
based on Gibbs sampling from the hidden layer. 


What we're still missing is code that allows us to perform the following: 


e Compute the free energy of the model. As we discussed, the model uses energy as the term to do 
the following: 
o Implement PCD using our Gibbs sampling step code, and setting the Gibbs step count 
parameter, k = /, to compute the parameter gradient for gradient descent 
o Create a means to feed the output of PCD (the computed gradient) to our previously defined 
network update code 
e Develop the means to track the progress and success of our RBM throughout the training. 


First off, we'll create the means to calculate the free energy of our RBM. Note that this is the inverse 
log of the probability distribution for the hidden layer, which we discussed earlier: 


Get Tice nergy (selty Vo Semple): 


wx D = T,00C(v sample, Seli.«W) * Selt.nbias 


VOLas term = TsooOlly semple, Seltevybiae) 
hidden term, = "F.sum( lsbogtl.  Twexp(wx >) )y, axae= 1) 
EetUrn. —hidden term = vbias Term 


Next, we'll implement PCD. At this point, we'll be setting a couple of interesting parameters. The ir, 
short for learning rate, is an adjustable parameter used to adjust learning speed. The k parameter 
points to the number of steps to be performed by PCD (remember the PCD-k notation from earlier in 
the chapter’). 


We discussed the PCD as containing two phases, positive and negative. The following code computes 
the positive phase of PCD: 


GOCE Gel Cost. Updates (selrt, tr-U.1, PSersastert = 7; k=1)% 


Dre. 6190m01d pi, ph mean, Dh sample = 
self.sample h given v(self.input) 


Chel. Stare, = Pereteroie 


Meanwhile, the following code implements the negative phase of PCD. To do so, we scan the 
gibbs hvh function k times, using Theano's scan operation, performing one Gibbs sampling step with 
each scan. After completing the negative phase, we acquire the free energy value: 


( 
Pre -S190Mo1G Nvs, 


nv means, WOW! eBook 
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nv samples, 
pre sigmoid nhs, 
nh means, 
nh samples 
l, 
updates 
) = theano.scan ( 
SELE«Gi10bs Nvh, 
OUTDUTS Anto=|(None, None, None, None, None, Chain start), 


i Svepe=K 
) 
Chain end = ny samples |—1) 
COst, = Teiiean(Selt.tree energy (selit.tnput)) — Temeant 


self.free energy(chain end) ) 


gparams = T.grad(cost, self.params, 
Consicer COoMmstant=|Chain end] 


Having written code that performs the full PCD process, we need a way to feed the outputs to our 
network. At this point, we're able to connect our PCD learning process to the code to update the 
network that we reviewed earlier. The preceding updates dictionary points to theano.scan of the 
gibbs hvh function. As you may recall, gibbs hvh currently contains rules for random states of 
theano rng. What we need to do now ts add the new parameter values and variable containing the 
state of the Gibbs chain to the dictionary (the updates variable): 


for gparam, param in zip(gparams, self.params) : 
updates[param] = param - gparam * T.cast ( 
Tey 
dtype=theano.config.floatx 


Updaves: — nh samples.) 
MOnLLOring Cost = 
SeLi.Cee. Pecuc®.  1he 18000 Cost (Updales) 


FeltUrh: MONPEOrInNG Cost, Updates 


We now have almost all the parts that we need to make our RBM work. What's clearly missing is a 
means to inspect training, either during or after completion, to ensure that our RBM is learning an 
appropriate representation of the data. 


We talked previously about how to train an RBM, specifically about challenges posed by the partition 
function. Furthermore, earlier in the code, we implemented one means by which we can inspect an 
RBM during training; we created the gibbs_ vhv function to perform Gibbs sampling from the model. 


In our previous discussion around how to validate an RBM, we discussed visually plotting the filters 
that the RBM has created. We'll review hoyy, this val psachieved shortly. 


The final possibility 1s to use the inverse log of the PLas a more tractable proxy to the likelihood 
itself. Technically, the log-PL is the sum of the log-probabilities of each data point (each x) 
conditioned on all other data points. As discussed, this becomes too expensive with larger- 
dimensional datasets, so a stochastic approximation to log-PL1s used. 


We referenced a function that will enable us to get PL cost during the get cost updates function, 
specifically the get pseudo likelihood cost function. Now it's time to flesh out this function and 
obtain the pseudo-likelihood: 


jer get. pseudo Jikelinood Cost (seit, updates): 


Dit a tex = Uheane.Shared(valuc=0, Name=" bat 2. tax") 
T.round(self.input) 


baa 


re x1, = Selt «Pree energy (21) 
<i ii: = Peset SUbCensOCtx. |, bie tt tex, L = Sele, 
bat 2. 10x ]) 


ce Mi thip = Selist ee nero yt FL) 


Cost. = Tween (selit.m Visiove: + 
TOG (Ts hnet.cLOMOoroi re hi. Titp = te 1 ).)) 


UeGeates OLE. a 10x) = (Ore t 2Ox + 1) @ Seli.n Vistole 


return cost 


We've now filled out each element on the list of missing components and have completely reviewed 
the RBM class. We've explored how each element ties into the theory behind the RBM and should now 
have a thorough understanding of how the RBM algorithm works. We understand what the outputs of 
our RBM will be and will soon be able to review and assess them. In short, we're ready to train our 
RBM. Beginning the training of the RBM 1s a matter of running the following code, which triggers the 
train set x function. We'll discuss this function in greater depth later in this chapter: 


train vom = Cheano.Tuncrion { 

L mice x |), 

cost, 

updates=updates, 

givens={ 
x? Eieadn Sel x index ~ babch S126: (index + 1) * 
batch S176) 

by 

fieame="train bm’ 


) 


DLOCEI IG taine = 0. 
Stare time = <itie.clock.) 


Having updated the RBM's updates and trainingset: ie run through training epochs. Within each 


webook.org 


epoch, we train over the training data before plotting the weights as a matrix (as described earlier in 
the chapter): 


FOL Spock 10 krange (Craining epochs) = 


mean.cosLt = [i 
FOr Daten Andex im xrange(n Train Datcnes) 


Mean. COst: += (train rom (oatcn andex). | 
Prone *Traimang epoch sd; Cost as * @ epoch, 


numMpy.Mean (mean Cost} 


PbOLEING Start = Time.clLock() 
image = Image.fromarray ( 
tile raster images ( 
A=COM.~W.GEU Value (borrow=True) .1; 
img shape=(28, 28), 
Cike Saope= (LU, 20); 
Cie Sspacing=(1, 1) 


) 

VNAGe. Save (LLCS S at CpOCh mlspng- «= Spoch) 
DPlOLtiog Stop = time.clock() 

DlOCEING.Lime r= (plotting Stop = plotting start) 


end time = time.clock) 
Precraining. time — (end Time = Start time) — plotting. time 
Print ("Training took st minutces® = (pretraining time 7 60:)) 


The weights tend to plot fairly recognizably and resemble Gabor filters (linear filters commonly used 
for edge detection in images). If your dataset is handwritten characters on a fairly low-noise 
background, you tend to find that the weights trace the strokes used. For photographs, the filters will 
approximately trace edges in the image. The following image shows an example output: 
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Finally, we create the persistent Gibbs chains that we need to derive our samples. The following 
function performs a single Gibbs step, as discussed previously, then updates the chain: 


ploOL every = 1000 


( 


piresic fos, 
had..mt Ss, 
hid samples, 
Dresl9g Vis; 
Vis MiSs, 
Vis Samples 
|, 
updates 
) = theano.scan ( 
OM. GTDOSs.Vnv, 
OULDULS a2ntO=|None, None, None, None, None, persistent vis Chain), 
hh SECps—-pLOL. very 


This code runs the gibbs_ vhv function we described previously, plotting network output samples for 
our inspection: 


updates.update({persistent vis chain: vis samples[-1]}) 
Sample in = tCheano.funer16n( 
[], 
[ 
Vie Me Ss 
Vis. Samples |= | 
l, 
updates=updates, 
name='sample fn' 


image data = numpy.zeros( 
2s = f Semples 7 4, 2a = tH Coane = 2), 
dtype='uint8'! 

) 


fOr 10% an xtange(m samples) ; 


vis mf, vis_sample = sample fn() 
print ' ... plotting sample ', idx 
uma CatalZ2 ~ tOxtZ29 = 20x 7 Zo, 2] = Lite rester ameges 


X=ViS mi; 

IMG Shape=(20, 23)4 

CLC Siape=(1, 1 Chase); 
Cilse Spacing=(1, 4) 


image = Image.fromarray(image data) 


image.Save(*samples,png”) WOW! eBook 
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At this point, we have an entire RBM. We have the PCD algorithm and the ability to update the 
network using this algorithm and Gibbs sampling. We have several visible output methods so that we 
can assess how well our RBM has trained. 


However, we're not done yet! Next, we'll begin to see what the most frequent and powerful 
application of the RBM 1s. 
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Further applications of the RBM 


We can use the RBM as an ML algorithm in and of itself. It functions comparably well with other 
algorithms. Advantageously, it can be scaled up to a point where it can learn high-dimensional 
datasets. However, this isn't where the real strength of the RBM lies. 


The RBM is most commonly used as a pretraining mechanism for a highly effective deep network 
architecture called a DBN. DBNs are extremely powerful tools to learn and classify a range of image 
datasets. They possess a very good ability to generalize to unknown cases and are among the best 
image-learning tools available. For this reason, DBNs are in use at many of the world's top tech and 
data science companies, primarily in image search and recognition contexts. 


WOW! eBook 
www.wowebook.org 


Deep belief networks 


A DBN 1s a graphical model, constructed using multiple stacked RBMs. While the first RBM trains a 
layer of features based on input from the pixels of the training data, subsequent layers treat the 
activations of preceding layers as if they were pixels and attempt to learn the features in subsequent 
hidden layers. This 1s frequently described as learning the representation of data and is a common 
theme in deep learning. 


How many multiple RBMs there should be depends on what 1s needed for the problem at hand. From 
a practical perspective, it's a trade-off between increasing accuracy and increasing computational 
cost. It is the case that each layer of RBMs will improve the lower bound of the log probability of the 
training data. In other words; the DBN almost inevitably becomes less bad with each additional layer 
of features. 


As far as layer size 1s concerned, it is generally advantageous to reduce the number of nodes 1n the 
hidden layers of successive RBMs. One should avoid contexts in which an RBM has at least as many 
visible units as the RBM preceding it has hidden units (which raises the risk of simply learning the 
identity function of the network). 


It can be advantageous (but 1s by no means necessary) when successive RBMs decrease in layer size 
until the final RBM has a layer size approximating the dimensionality of variance in the data. Affixing 
an MLP to the end of a DBN whose layers have too many nodes will harm classification performance; 
it's like trying to affix a drinking straw to the end of a hosepipe! Even an MLP with many neurons may 
not successfully train in such contexts. On a related note, it has been noted that even if the layers don't 
contain very many nodes, with enough layers, more or less any function can be modeled. 


Determining what the dimensionality of variance in the data 1s, 1s not a simple task. One tool that can 
support this task 1s PCA; as we saw in the preceding chapter, PCA can enable us to get a reasonable 
idea as to how many components of meaningful size exist in the input data. 
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Training a DBN 


Training a DBN 1s typically done greedily, which 1s to say that it trains to optimize locally at each 
layer, rather than attempting to reach a global optimum. The learning process 1s as follows: 


e The first layer of the DBN is trained using the method that we saw in our earlier discussion of 
RBM learning. As such, the first layer converts its data distribution to a posterior distribution 
using Gibbs sampling over the hidden units. 

e This distribution 1s far more conducive for RBM training than the input data itself so the next 
RBM layer learns that distribution! 

e Successive RBM layers continue to train on the samples output by preceding layers. 

e All of the parameters within this architecture are tuned using a performance measure. 


This performance measure may vary. It may be a log-likelihood proxy used in gradient descent, as 
discussed earlier in the chapter. In supervised contexts, a classifier (for example, an MLP) can be 
added as the final layer of the architecture and prediction accuracy can be used as the performance 
measure to fine-tune the deep architecture. 


Let's move on to using the DBN in practice. 
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Applying the DBN 


Having discussed the DBN and theory surrounding it, it's time to set up our own. We'll be working in 
a similar way to the RBM, by walking through a pBw class and connecting the code to the theory, 
discussing what to expect and how to review the network's performance, before initializing and 
training our network to see it in action. 


Let's take a look at our DBN class: 
class DBN(object): 


Get... inte (self, numpy rng, theano rng=None, i. 2ns=/s4, 
hiGoen eyers s1Z7es—(500, DO0l, BO OUCS—-LU)= 


seltseicomoic layers = f.] 

colLitehhm tayere = ii 

self.params = [] 

seliam Jdavyers = Jenin oder layers S12765) 


Gece Selicm [ayers 2 0 


Lt NOt Eheane 21g; 
Lheano 1nG. = KRandomocreams (Numpy rig.rendint(2. ** 30), 


Sselrt.x = .matraux(* x") 
self.y = T.ivector('y') 


The ppn class contains a number of parameters that bear further explanation. The numpy_ rng and 
theano rng parameters, used to determine initial weights, are already familiar from our examination 
of the rpm class. The n ins parameter is a pointer to the dimension (in features) of the DBN's input. 
The hidden layers sizes parameter 1s a list of hidden layer sizes. Each value in this list will 
guide the DBN constructor in creating an RBM layer of the relevant size; as you'll note, the n layers 
parameter refers to the number of layers 1n the network and 1s set by hidden layers sizes. 
Adjustment of values in this list enables us to make DBNs whose layer sizes taper down from the 
input layer size, to increasingly succinct representations, as discussed earlier in the chapter. 


It's also worth noting that self.sigmoid layers will store the MLP component (the final layer of 
the DBN), while self.rbm layers stores the RBM layers used to pretrain the MLP. 


With this done, we do the following to complete our DBN architecture: 


e Wecreate n layers sigmoid layers 

e We connect the sigmoid layers to form an MLP 

e We construct an RBM for each sigmoid layer with a shared weight matrix and hidden bias 
between each sigmoid layer and RBM 


The following code creates n_ layers many layens agatlx sigmoid activations; first creating the input 
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layer, then creating hidden layers whose size corresponds to the values in our 


hidden layers sizes list: 


Ot 4. 2h Meanoe(seli.n layers) = 


if i == 0: 

Tipu cee = i aces 
else: 

IMNpUL Size = btddénm Jayers si zesi1 =- 1) 

1 ot == OF 

Laver 1npul = selt.x 
else. 

Heyer Anpul = Sselt.si1gmolo Layers | Ll .output 
sigmoid layer = HiddenLayer (rng=numpy rng, 


LNpuv=lLavyer 2npuc, 

i 1n=tnpul. Size, 

i OMe Olen oer soa 2e Say 

activation=T.nnet.sigmoid) 
Sselt.sigmoi0d Jayers.append (sigmoid layer) 


self.params.extend (sigmoid layer.params) 


Next up, we create an RBM that shares weights with the sigmoid layers. This directly leverages the 
RBM Class that we described previously: 


tom teyer = REM (numMpy. rng=—numpy rng, 
theano tnig-Eneano 11g, 
input=layer input, 
ft VEsSloLe=10bu S176, 
1 hiocen=hadcen layers sizeci(ai, 
W=sigmoid layer.wW, 
hOteas=sitomo1d Teaver.) 

Sseli«l DM Jayers«appeuct( rom layer) 


Finally, we add a logistic regression layer to the end of the DBN so as to form an MLP: 


self.logLayer = LogisticRegression ( 
LApUe=Ssell~sitgmoid layers |—1).0ulput, 
ih 2ia=htecen. layers sizesi<l), 
fh OUCH. Outs) 
self.params.extend(self.loghayer.params) 


selistimetune cose = selT.legqlaver.negative toc 17 kelanood (self .7) 


self.errors = self.logLayer.errors(self.y) 


Now that we've put together our mip class, let's construct pBN. The following code constructs the 
network with 28 * 28 inputs (that 1s, 28*28 pixels in the MNIST image data), three hidden layers of 
decreasing size, and 10 output values (for each of the 10 handwritten number classes in the MNIST 
dataset): 
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numpy rng = numpy.random. Randqwwki.wewéboolWorg 


print '... building the model' 

CDi: = DEN (OUMpYy 2nGO=numpy 2G, 1 Ins-2e.* Zo, 
hageen layers si1zes=(1000, 600, 7201; 
mn -outs=10) 


As discussed earlier in this section, a DBN trains in two stages—a layer-wise pretraining in which 
each layer takes the output of the preceding layer to train on, which is followed by a fine-tuning step 
(backpropagation) that allows for weight adjustment across the whole network. The first stage, 
pretraining, 1s achieved by performing one step of PCD within each layer's RBM. The following code 
will perform this pretraining step: 


print '... getting the pretraining functions’ 
Pretraiving (ns = 

Obi. DreCeral ming TUncr ions (rail) See X=Crain Set x, 
batch, SizZe—-batch Size, K=k) 


print '... pre-training the model' 
Sra G. Cime: = (1Me.clOck() 


fOr 1. 2m xrange(Gon«n layers) ; 
FOr GpoOCch anh Xraenge (Orel raining: Spocnse) 
C= i] 
FOr batch andex 10 Rrancge (i train Davtches) : 
C-appena (pretraifing fins 1) (index=batch index, 
Le=pretrain. <2) ) 
Drint *Prée-Ctraining layer <i, epoch sd, cost * = {1,;, epoch), 
print numpy.mean (c) 


end time = time.clock() 


Running the pretrained DBN is then achieved by the following command: 


python code/DBN.py 
Note 


Note that even with GPU acceleration, this code will spend quite a lot of time pretraining, and it is 
therefore suggested that you run it overnight. 
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Validating the DBN 


Validation of a DBN as a whole is done ina very familiar way. We can use the minimal validation 
error from cross-validation as one error measure. However, the minimal cross-validation error can 
underestimate the error expected on cross-validation data as the meta-parameters may overfit to the 
new data. 


As such, we should use our cross-validation error to adjust our metaparameters until the cross- 
validation error 1s minimized. Then we should expose our DBN to the held-out test set, using test 
error as our validation measure. Our DBN class performs exactly this training process. 


However, this doesn't tell us exactly what to do 1f the network fails to train adequately. What do we 
do if our DBN is underperforming? 


The first thing to do 1s recognize the potential causes and, 1n this area, there are some usual culprits. 
We know that the training of underlying RBMs is also quite tricky and any individual layer may fail to 
train. Thankfully, our RB class gives us the ability to tap into and view the weights (filters) being 
generated by each layer, and we can plot these to get a view on what our network is attempting to 
represent. 


Additionally, we want to ask whether our network is overfitting, or else, underfitting. Either 1s 
entirely possible and it's useful to recognize how and why this might be happening. In the case of 
underfitting, the training process may simply be unable to find good parameters for the model. This is 
particularly common when you are using a larger network to resolve a large problem space, but can 
be seen even with some smaller models. If you think that underfitting might be happening with your 
DBN, you have a couple of options. The first is to simply reduce the size of your hidden layers. This 
may, or may not, work well. A better alternative is to gradually taper your hidden layers such that 
each layer learns a refined version of the preceding layer's representation. How to do this, how 
sharply to taper, and when to stop is a matter of trial and error 1n the first case and of experience- 
based learning over the long term. 


Overfitting 1s a well-known phenomenon where your algorithm trains overly specifically on the 
training data provided. This class of problem 1s typically identified at the point of cross-validation 
(where your error rate will increase dramatically), but can be quite pernicious. Means of resolving an 
overfitting issue do exist; one can increase the training dataset size. A more heavy-handed Bayesian 
approach would be to attach an additional criterion (for example, a prior) that is used to reduce the 
value of fitting the training data. Some of the most effective methods to improve classification 
performance are preprocessing methods, which we'll discuss in Chapters 6, 7ext Feature 
Engineering and Chapter 7, Feature Engineering Part II. 


Though this code will initialize from a predefined position (given a seed value), the stochastic nature 
of the model means that it will quickly diverge and results may vary. When running on my system, this 
DBN achieved a minimal cross-validation error of 1.1 .19%. More importantly, it achieved a test error 

of 1.30% after 46 supervised epochs. These,aresnoc results: indeed, they are comparable with field- 


leading examples! 
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Further reading 


For a primer on neural networks, it makes sense to read from a range of sources. There are many 
concerns to be aware of and different authors emphasize on different material. A solid introduction is 
provided by Kevin Gurney in An Introduction to Neural Networks. 


An excellent piece on the intuitions underlying Markov Chain Monte Carlo is available at 
http://twiecki. github.10/blog/2015/11/10/meme-sampling/. 





For readers with a specific interest in the intuitions supporting Gibbs Sampling, Philip Resnik, and 
Eric Hardisty's paper, Gibbs Sampling for the Uninitiated, provides a technical, but clear 
description of how Gibbs works. It's particularly notable to have some really first-rate analogies! 
Find them at https://www.umiacs.umd.edu/~resnik/pubs/LAMP-TR-153.pdf. 





There aren't many good explanations of Contrastive Divergence, one I like 1s provided by Oliver 
Woodford at http://www.robots.ox.ac.uk/~ojw/files/NotesOnCD.pdf. If you're a little daunted by the 
heavy use of formal expressions, I would still recommend that you read it for its articulate description 
of theory and practical concerns involved. 


This chapter used the Theano documentation available at http://deeplearning.net/tutorial/contents.html 
as a base for discussion and implementation of RBM and DBN classes. 
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Summary 


We've covered a lot of ground in this chapter! We began with an overview of Neural Networks, 
focusing on the general properties of topology and learning method before taking a deep dive into the 
RBM algorithm and RBM code itself. We took this solid understanding forward to create a DBN. In 
doing so, we linked the DBN theory and code together, before firing up our DBN to work over the 
MNIST dataset. We performed image classification ina 10-class problem and achieved an extremely 
competitive result, with classification error below 2%! 


In the next chapter, we'll continue to build on your mastery of deep learning by introducing you to 
another deep learning architecture—Stacked Denoising Autoencoders (SDA). 


WOW! eBook 
www.wowebook.org 


Chapter 3. Stacked Denoising Autoencoders 


In this chapter, we'll continue building our skill with deep architectures by applying Stacked 
Denoising Autoencoders (SdA) to learn feature representations for high-dimensional input data. 


We'll start, as before, by gaining a solid understanding of the theory and concepts that underpin 
autoencoders. We'll identify related techniques and call out the strengths of autoencoders as part of 
your data science toolkit. We'll discuss the use of Denoising Autoencoders (dA), a variation of the 
algorithm that introduces stochastic corruption to the input data, obliging the autoencoder to decorrupt 
the input and, in so doing, build a more effective feature representation. 


We'll follow up on theory, as before, by walking through the code for a dA class, linking theory and 
implementation details to build a strong understanding of the technique. 


At this point, we'll take a journey very similar to that taken in the preceding chapter—by stacking dA, 
we'll create a deep architecture that can be used to pretrain an MLP network, which offers substantial 
performance improvements 1n a range of unsupervised learning applications including speech data 
processing. 
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Autoencoders 


The autoencoder (also called the Diabolo network) 1s another crucial component of deep 
architectures. The autoencoder is related to the RBM, with autoencoder training resembling RBM 
training; however, autoencoders can be easier to train than RBMs with contrastive divergence and are 
thus preferred in contexts where RBMs train less effectively. 
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Introducing the autoencoder 


An autoencoder 1s a simple three-layer neural network whose output units are directly connected back 
to the input units. The objective of the autoencoder is to encode the i-dimensional input into an h- 
dimensional representation, where h < 7, before reconstructing (decoding) the input at the output 
layer. The training process involves iteration over this process until the reconstruction error 1s 
minimized—at which point one should have arrived at the most efficient representation of input data 
(should, barring the possibility of arriving at local minima!). 


In a preceding chapter, we discussed PCA as being a powerful dimensionality reduction technique. 
This description of autoencoders as finding the most efficient reduced-dimensional representation of 
input data will no doubt be familiar and you may be asking why we're exploring another technique 
that fulfils the same role. 


The simple answer is that like the SOM, autoencoders can provide nonlinear reductions, which 
enables them to process high-dimensional input data more effectively than PCA. This revives a form 
of our earlier question—why discuss autoencoders 1f they deliver what an SOM does, without even 
providing the illuminating visual presentation? 


Simply put, autoencoders are a more developed and sophisticated set of techniques; the use of 
denoising and stacking techniques enable reductions of high-dimensional, multimodal data that can be 
trained with relative ease to greater accuracy, at greater scale, than the techniques that we discussed 
in Chapter 1, Unsupervised Machine Learning. 


Having discussed the capabilities of autoencoders at a high level, let's dig in a little further to 
understand the topology of autoencoders as well as what their training involves. 


Topology 


As described earlier in this chapter, an autoencoder has a relatively simple structure. It 1s a three- 
layer neural network, with input, hidden, and output layers. The input feeds forward into the hidden 
layer, then the output layer, as with most neural network architectures. One topological feature worth 
mentioning 1s that the hidden layer is typically of fewer nodes than the input or output layers. 
(However, as intimated previously, the required number of hidden nodes 1s really a function of the 
complexity of the input data; the goal of the hidden layer is to bottleneck the information content from 
the input and force the network to identify a representation that captures underlying statistical 
properties. Representing very complex input accurately might require a large quantity of hidden 
nodes.) 


The key feature of an autoencoder is that the output is typically set to be the input; the performance 
measure for an autoencoder 1s its accuracy in reconstructing the input after encoding it within the 
hidden layer. Autoencoder topology tends to take the following form: 
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Jaswode 
hidden 


te ncode 





The encoding function that occurs between the input and hidden layers is a mapping of an input (x) to 
anew form(y). A simple example mapping function might be a nonlinear (in this case sigmoid, s) 
function of the input as follows: 


y=s(Wx+b) 


However, more sophisticated encodings may exist or be developed to accommodate specific subject 
domains. In this case, of course, W represents the weight values assigned to x and 5 is an adjustable 
variable that can be tuned to enable the minimization of reconstruction error. 


The autoencoder then decodes to deliver its output. This reconstruction is intended to take the same 
shape as x and will occur through a similar transformation as follows: 


z=s(Wy+b') 


Here, 5’ and W’ are typically also configurable to allow network optimization. 
Training 


The network trains, as discussed, by minimizing the reconstruction error. One popular method to 
measure this error is a simple squared error measure, as shown in the following formula: 


| ) 
E = 5IF - x|| 
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However, different and more appropriate error measures exist for cases where the input is 1n a less 
generic format (such as a set of bit probabilities). 


While the intention is that autoencoders capture the main axes of variation in the input dataset, it 1s 
possible for an autoencoder to learn something far less useful—the identity function of the input. 
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Denoising autoencoders 


While autoencoders can work well in some applications, they can be challenging to apply to 
problems where the input data contains a complex distribution that must be modeled in high 
dimensionality. The major challenge is that, with autoencoders that have n-dimensional input and an 
encoding of at least , there is a real likelihood that the autoencoder will just learn the identity 
function of the input. In such cases, the encoding 1s a literal copy of the input. Such autoencoders are 
called overcomplete. 


Note 


One of the most important properties when training a machine learning technique 1s to understand how 
the dimensionality of hidden layers affects the quality of the resulting model. In cases where the input 
data is complex and the hidden layer has too few nodes to capture that complexity effectively, the 
result is obvious—the network fails to train as well as it might with more nodes. 


To capture complex distributions 1n input data, then, you may wish to use a large number of hidden 
nodes. In cases where the hidden layer has at least as many nodes as the input, there 1s a strong 
possibility that the network will learn the identity of the input; in such cases, each element of the input 
is learned as a specific unique case. Naturally, a model that has been trained to do this will work very 
well over training data, but as it has learned a trivial pattern that cannot be generalized to unfamiliar 
data, it is liable to fail catastrophically when validated. 


This 1s particularly relevant when modeling complex data, such as speech data. Such data is 
frequently complex in distribution, so the classification of speech signals requires multimodal 
encoding and a high-dimensional hidden layer. Of course, this brings an increased risk of the 
autoencoder (or any of a large number of models as this 1s not an autoencoder-specific problem) 
learning the identity function. 


While (rather surprisingly) overcomplete autoencoders can and do learn error-minimizing 
representations under certain configurations (namely, ones in which the first hidden layer needs very 
small weights so as to force the hidden units into a linear orientation and subsequent weights have 
large values), such configurations are difficult to optimize for, and it has been desirable to find 
another way to prevent overcomplete autoencoders from learning the identity function. 


There are several different ways that an overcomplete autoencoder can be prevented from learning 
the identity function while still capturing something useful within its representation. By far, the most 
popular approach is to introduce noise to the input data and force the autoencoder to train on the noisy 
data by learning distributions and statistical regularities rather than identity. This can be effectively 
achieved by multiple methods, including using sparseness constraints or dropout techniques (wherein 
input values are randomly set to zero). 


The process that we'll be using to introduce noise to the input in this chapter 1s dropout. Via this 
method, up to half of the inputs are randomly st0vo geeok To achieve this, we create a stochastic 
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corruption process that operates on our input data: 
Gen Get COrriupted 1fpul(selt, 2mput, Corruption level) - 


return? sell.theano 2g. oincmial (Ss. 7e=-1npul.shape, n=l; p=l = 
COLrrUpE1On evel, Otype=Cheano -contig.tloacx) * 1npur 


In order to accurately model the input data, the autoencoder has to predict the corrupted values from 
the uncorrupted values, thus learning meaningful statistical properties (that 1s, distribution). 


In addition to preventing an autoencoder from learning the identity values of data, adding a denoising 
process also tends to produce models that are substantially more robust to input variations or 
distortion. This proves to be particularly useful for input data that 1s inherently noisy, such as speech 
or image data. One commonly recognized advantage of deep learning techniques, mentioned in the 
preface to this book, 1s that deep learning algorithms minimize the need for feature engineering. 
Where many learning algorithms require lengthy and complicated preprocessing of input data 
(filtering of images or manipulation of audio signals) to reconstruct the denoised input and enable the 
model to train, a dA can work effectively with minimal preprocessing. This can dramatically 
decrease the time it takes to train a model over your input data to practical levels of accuracy. 


Finally, it's worth observing that an autoencoder that learns the identity function of the input dataset is 
probably misconfigured in a fundamental way. As the main added value of the autoencoder 1s to find a 
lower-dimensional representation of the feature set, an autoencoder that has learned the identity 
function of the input data may simply have too many nodes. If 1n doubt, consider reducing the number 
of nodes in your hidden layer. 


Now that we've discussed the topology of an autoencoder—the means by which one might be 
effectively trained and the role of denoising 1n improving autoencoder performance—let's review 
Theano code for a dA so as to carry the preceding theory into practice. 
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Applying adA 


At this point, we're ready to step through the implementation of a dA. Once again, we're leveraging 
the Theano library to apply a aa class. 


Unlike the Rpm class that we explored in the previous chapter, the DenoisingAutoencoder is relatively 
simple and tying the functionality of the dA to the theory and math that we examined earlier in this 
chapter is relatively simple. 


In Chapter 2, Deep Belief Networks, we applied an Rem class that had a number of elements that, 
while not necessary for the correct functioning of the RBM 1n itself, enabled shared parameters within 
multilayer, deep architectures. The da class we'll be using possesses similar shared elements that 
will provide us with the means to build a multilayer autoencoder architecture later in the chapter. 


We begin by initializing a da class. We specify the number of visible units, n visible, as well as the 
number of hidden units, n hidden. We additionally specify variables for the configuration of the input 
(input) as well as the weights (w) and the hidden and visible bias values (bhid and bvis 
respectively). The four additional variables enable autoencoders to receive configuration parameters 
from other elements of a deep architecture: 


class dA(object): 


Ger. wnae 4 
self, 
numpy rng, 
Lneano: Tng=NoOnG, 
input=None, 
Wn Visiple=7384, 
my Haoeen= 200, 
W=None, 
bhid=None, 
bvis=None 


Selle Visiole = iH Viet ke 
SsCLiat DIdeen = 2 Daoden 


We follow up by initialising the weight and bias variables. We set the weight vector, w to an initial 
value, initial w, which we obtain using random, uniform sampling from the range: 





6. , 
— Tae a to 4% —SSSSSS_ 
(in _ hidden+n_ visible) A Hii PE i 


We then set the visible and hidden bias variables to.arrays of zeroes using numpy. zeros: 
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it NOL Lheand: 1nG: 


theano ©nG = Randomstreams (numpy 7no.reandint(Z ** 30) 
1f not W: 
Lniiive W = NuMpy.asarray ( 


numpy rng.uniform ( 
low=-4 * numpy.sqrt(6é. / (n hidden + n visible)), 
high=4 * numpy.sqrt(6. / (n hidden + n visible)), 
Si Z6=(0 Visible, nm hacen) 

)y 

dtype=theano.config.floatx 


) 
W = theano.shared(value=initial W, name='W', borrow=True) 


ie. SOT JOS: s 
bvis = theano. shared ( 
value=numpy. zeros ( 
ik Visi Le; 
dtype=theano.config.floatx 
) T 


DOrrow=(True 


Lr NOL bid: 
bhid = theano. shared ( 

value=numpy. zeros ( 
in doen, 
dtype=theano.config.floatx 

)y 

name='b', 

borrow=True 


Earlier in the chapter, we described how the autoencoder translates between visible and hidden 


| | v=s(Wxr+b a ee 
layers via mappings such as ° ( . To enable such translation, it 1s necessary to define w, b, 


w', and b' inrelation to the previously described autoencoder parameters, bhid, bvis, and w. w' and 
b' are referred to aS W prime andb prime in the following code: 


self.w = W 
self.b = bhid 
Sei 0 prime = DYVLs 
Sselt.W prame = Selr.W.7 
self.theano rng = theano rng 
1f input is None: 
self.x = T.dmatrix(name='input') 
else: 
self.x = input 


SelLisparais = (SeliaW, Selti.b, Sella. pring) 


The preceding code sets b and b prime to bhid and bvis respectively, while w prime 1s set as the 


transpose of w; in other words, the weights arév@wvdeHaed weights are sometimes, but not always, used 
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in autoencoders for several reasons: 


e Tying weights improves the quality of results in several contexts (albeit often in contexts where 
the optimal solution 1s PCA, which 1s the solution an autoencoder with tied weights will tend to 
reach) 

e Tying weights improves the memory consumption of the autoencoder by reducing the number of 
parameters that need be stored 

e Most importantly, tied weights provide a regularization effect; they require one less parameter to 
be optimized (thus one less thing that can go wrong!) 


However, in other contexts, 1t's both common and appropriate to use untied weights. This is true, for 
instance, in cases where the input data is multimodal and the optimal decoder models a nonlinear set 
of statistical regularities. In such cases, a linear model, such as PCA, will not effectively model the 
nonlinear trends and you will tend to obtain better results using untied weights. 


Having configured the parameters to our autoencoder, the next step is to define the functions that 
enable it to learn. Earlier in this chapter, we determined that autoencoders learn effectively by adding 
noise to input data, then attempting to learn an encoded representation of that input that can in turn be 
reconstructed into the input. What we need next, then, are functions that deliver this functionality. We 
begin by corrupting the input data: 


Gee Get. COT upce? t1pue(selr, 2npuGc, COLrup ten. evel) : 


feturn Sell «theano 1190 .b1nomiel (St7Ze-1Npurseoape, B=L, p=lL:= 
COrrUpLION Level, -Olype-tneano.contigq.tTloartx) * anpuL 


The degree of corruption 1s configurable using a corruption level parameter; as we recognized 
earlier, the corruption of the input through dropout typically does not exceed 50% of cases, or 0.5. 
The function takes a random set of cases, where the number of cases is that proportion of the input 
whose size 1S equal to corruption level. The function produces a corruption vector of 0's and J's 
equal in length to the input, where a corruption level sized proportion of the vector 1s 0. The 
corrupted input vector is then simply a multiple of the autoencoder's input vector and corruption 
vector: 


def get hidden values(self, input): 
return T.nnet.sigmoid(T.dot(input, self.W) + self.b) 


Next, we obtain the hidden values. This is done via code that performs the equation ye an+e) to 


obtain y (the hidden values). To get the autoencoder's output (z), we reconstruct the hidden layer via 


| z=s(Wy+b') 
code that uses the previously defined b prime and Ww prime to perform ia 


COSrCGee PeSCOnSstrucCleG AnpulL(sett, Didden): 
return). nnel.«si19m0ld (1 .cdorlhiaden, selt.W prime) = 
Set. 0. prime) 


The final missing piece is the calculation of c¥S?\tipeatels. We reviewed one cost function previously, 
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ly 2 
| E=s|z-xf | | 
a simple squared error measure: “ . Let's use this cost function to calculate our cost 


updates, based on the input (x) and reconstruction (z): 


CGE Gel COsSt Upcatesisell, Corruption level, iearning rate): 


biLLde x = S617.06e. Corrmupted s2npur(selt.2, COrtupt1oOn level) 
VY = selr.cet hidden values (tilde x) 
i= SOlreGSe. PECONStruUcCceO. tnpul (7) 
B= (0.5 * (Tez — T.selt.x)) *~ 2 
cost = T.mean (E) 
gparams = T.grad(cost, self.params) 
updates = [ 
(param, param — learning rate * Gparam) 


for param, gparam in zip(self.params, gparams) 


return (cost, updates) 


At this point, we have a functional dA! It may be used to model nonlinear properties of input data and 
can work as an effective tool to learn valid and lower-dimensional representations of input data. 
However, the real power of autoencoders comes from the properties that they display when stacked 
together, as the building blocks of a deep architecture. 
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Stacked Denoising Autoencoders 


While autoencoders are valuable tools in themselves, significant accuracy can be obtained by 
stacking autoencoders to forma deep network. This 1s achieved by feeding the representation created 
by the encoder on one layer into the next layer's encoder as the input to that layer. 


Stacked denoising autoencoders (SdAs) are currently in use 1n many leading data science teams for 
sophisticated natural language analyses as well as a hugely broad range of signals, image, and text 
analysis. 


The implementation of a SdA will be very familiar after the previous chapter's discussion of deep 
belief networks. The SdA is used 1n much the same way as the RBMs in our deep belief networks 
were used. Each layer of the deep architecture will have a dA and sigmoid component, with the 
autoencoder component being used to pretrain the sigmoid network. The performance measure used 
by a stacked denoising autoencoder 1s the training set error, with an intensive period of layer-to-layer 
(layer-wise) pretraining used to gradually align network parameters before a final period of fine- 
tuning. During fine-tuning, the network 1s trained using validation and test data, over fewer epochs but 
with larger update steps. The goal is to have the network converge at the end of the fine-tuning in 
order to deliver an accurate result. 


In addition to delivering on the typical advantages of deep networks (the ability to learn feature 
representations for complex or high-dimensional datasets, and the ability to train a model without 
extensive feature engineering), stacked autoencoders have an additional, interesting property. 


Correctly configured stacked autoencoders can capture a hierarchical grouping of their input data. 
Successive layers of a stacked denoised autoencoder may learn increasingly high-level features. 
Where the first layer might learn some first-order features from input data (such as learning edges ina 
photo image), a second layer may learn some grouping of first-order features (for instance, by 
learning given configurations of edges that correspond to contours or structural elements in the input 
image). 


There's no golden rule to determine how many layers or how large layers should be for a given 
problem. The best solution is usually to experiment with these model parameters until you find an 
optimal point. This experimentation 1s best done with a hyperparameter optimization technique or 
genetic algorithm (subjects we'll discuss in later chapters of this book). 


Higher layers may learn increasingly high-order configurations, enabling a stacked denoised 
autoencoder to learn to recognize facial features, alphanumerical characters, or generalized forms of 
objects (such as a bird). This 1s what gives SdAs their unique capability to learn very sophisticated, 
high-level abstractions of their input data. 


Autoencoders can be stacked indefinitely, and it has been demonstrated that continuing to stack 
autoencoders can improve the effectiveness ofthe deep architecture (with the main constraint 
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becoming compute cost in time). In this chapter, we'll look at stacking three autoencoders to solve a 
natural language processing challenge. 


WOW! eBook 
www.wowebook.org 


Applying the SdA 


Now that we've had a chance to understand the advantages and power of the SdA as a deep learning 
architecture, let's test our skills on a real-world dataset. 


For this chapter, let's step away from image datasets and work with the OpinRank Review dataset, a 
text dataset of around 259,000 hotel reviews from TripAdvisor—accessible via the UCI machine 
learning dataset repository. This freely-available dataset provides review scores (as floating point 
numbers from | to 5) and review text for a broad range of hotels; we'll be applying our stacked dA to 
attempt to identify the scoring of each hotel from its review text. 


Note 


We'll be applying our autoencoder to analyze a preprocessed version of this data, which is accessible 
from the GitHub share accompanying this chapter. We'll be discussing the techniques by which we 
prepare text data in an upcoming chapter. For the interested reader, the source data 1s available at 


https://archive.ics.uci.edu/ml/datasets/OpinRank+Review+Dataset. 


In order to get started, we're going to need a stacked denoising autoencoder (hereafter sda) class: 


class SdA(object): 


Ger Anat. { 
self, 
numpy rng, 
Lheano: Bng-=Noney, 
n 2nSs=200; 
hidden. layers S1.765=(500,. 200), 
iM OUES=); 
Corruperon levyels—|(J2i, Ol) 


ie 


AS we previously discussed, the sda is created by feeding the encoding from one layer's autoencoder 
as the input to the subsequent layer. This class supports the configuration of the layer count (reflected 
in, but not set by, the length of the hidden layers sizes and corruption levels vectors). It also 
supports differentiated layer sizes (in nodes) at each layer, which can be set using 

hidden layers sizes. As we discussed, the ability to configure successive layers of the 
autoencoder is critical to developing successful representations. 


Next, we need parameters to store the MLP (self.sigmoid layers) and dA(self.dA layers) 
elements of the sda. In order to specify the depth of our architecture, we use the self.n layers 
parameter to specify the number of sigmoid and dA layers required: 


Sselr.slGmoud layers: = |] 
self.dA layers = [] 
self.params = [] 


self.n layers = len(hidden layers siWOM} eBook 
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Spee Lect isl leavers 


Next, we need to construct our sigmoid and dA layers. We begin by setting the hidden layer size to be 
set either from the input vector size or by the activation of the preceding layer. Following this, 
sigmoid layer and dA layer components are created, with the dA layer drawing from the da class 
that we discussed earlier 1n this chapter: 


FOr 1 1m, Xvenge(Seliv®? Layers) 


if 1 == 0: 
[ou S176 = 1 is 
else: 
Input Si7e = haocen ayers: sizes 4a; > 1) 
Le a == OS 
beaver tnpuc = Selita 
else: 
layer anpuUl = selLiw~sigmoid layers |—1] .Outpue 
SiLGMO10 layer = Hiddenlayer(rno=-numpy. tng, aipue-laeyer Input; 2 AnN=1npul Size, 


M. OUE=hLOGen layers Sives|iil, acliveallon=1.nMmet.s19moi1d) 


selt.sigmoid.tayers.append(sigmoid layer) 
Sell spalaits.extend(sigmoid.tavyer.params) 


dA layer = dA(numpy rng=numpy rng, theano rng=theano rng, input=layer input, 
O VistDle=i1npur. S176, 1 nDVCcen=hidden layers sizesi2i},; W=sigmoi1d Javyer.wW, 
Dhid=s2.90mo1c: layer.) 


SC LE.GR Jevers poenc (OF. Jayer) 


Having implemented the layers of our stacked dA, we'll need a final, logistic regression layer to 
complete the MLP component of the network: 


self.logLayer = LogisticRegression ( 
INOS tec mole Myers | 11 sone, 
i teen aye Sie ly 
tl, OUC=n OUTS 


Sselt sperams.extlend (selr.loglayer.params) 
Selistinetiune Cost = Belt, oghavyersnega live too t1kelanoog (sel. y) 
self.errors = self.loghLayer.errors(self.y) 


This completes the architecture of our SdA. Next up, we need to generate the training functions used 
by the sda class. Each function will the minibatch index (index) as an argument, together with 
several other elements—the corruption level and learning rate are enabled here so that we 
can adjust them (for example, gradually increase or decrease them) during training. Additionally, we 
identify variables that help identify where the batch starts and ends—batch begin and batch end, 
respectively: 
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The ability to dynamically adjust the learning rate 1s particularly very helpful and may be applied in 
one of two ways. Once a technique has begun to converge on an appropriate solution, it is very 
helpful to be able to reduce the learning rate. If you do not do this, you risk creating a situation in 
which the network oscillates between values located around the optimum without ever hitting it. In 
some contexts, it can be helpful to tie the learning rate to the network's performance measure. If the 
error rate is high, 1t makes sense to make larger adjustments until the error rate begins to decrease! 


Get Pretreining £unctionms (seli, rain set x, Daten 6176) : 


index = T.lscalar('index') 

Corruption. level = Pesce lar(-Cor7zult10on" ) 
learning rare = T.scalar(* ir) 

Daten Degli = angex * Datch S12¢ 


betch end = batch begin + batch size 


Drelrai1n ths = |) 
Lor OA 1 Selt.dk layers: 


COSt, Upcates = dA.get Cost Updates (corruption Jevel, earning. rate) 
fn = theano. function ( 
inputs=[ 
ITnoOEexX, 


LOneenoO-.Palam(COrrupeton level, default—-U.2), 
tLheano.Param(learning rate, detaulte=0..1) 

l, 

OULPULS=COSt, 

Updaltes=updates, 

GLvens={ 
Seliixm: tilaln set Slbetcn Degins Datch end 

} 

) 


pretrain fns.append (fn) 


FSLurn preLrain Tus 


The pretraining functions that we've created takes the minibatch index and can optionally take the 
corruption level or learning rate. It performs one step of pretraining and outputs the cost value and 
vector of weight updates. 


In addition to pretraining, we need to build functions to support the fine-tuning stage, wherein the 
network is run iteratively over the validation and test data to optimize network parameters. The 
training function (train fn) seenin the code below implements a single step of fine-tuning. The 

valid score 1s a Python function that computes a validation score using the error measure produced 
by the sda over validation data. Similarly, test score computes the error score over test data. 


To get this process off the ground, we first need to set up training, validation, and test datasets. Each 
stage requires two datasets (set x and set y) containing the features and class labels, respectively. The 
required number of minibatches for validation and test 1s determined, and an index is created to track 
the batch size (and provide a means of identifying at which entries a batch starts and ends). Training, 


validation, and testing occurs for each batch andaftessward, both valid score andtest score are 
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calculated across all batches: 
OSf OUlJd Tinetine Tupecuionse (selt, Catascers;, Daten 6176, tearing, Pace) = 
(Crain Seu x, Vrain sel y) = dataseus| 0] 


) 
(‘VolLuG. See. x, Valid sel yy) = dataserce | 1) 
(Lest. Sel x; Lest Ser yy) = GCavasers || 


Mn Valid: Detches. = valia sel. x.geL. Value (borrow—-lrue) .shape |v] 
nm Veliad Datenes. (> Datch Size 
mh test. batches = est Set x.get value (borrow=[rue) «shape |v] 


Mm test batches (= batch 6126 


index = T.lscalar('index') 
OPatans — TsGrtaqc(Sell.tIMetine Cost, Selt.parame) 
updates = [ 

(patam, Param = Gparam * Jearning rate) 


For param, gparam in zip(self.params, gparams) 


Ciro Lh = Coeano Une com, 
inputs=[index], 
OuULpULS—SselT.Tinetunme Cost, 
updates=updates, 
givens={ 
Sella Poot see i) 
InCex * Dacch 61762 (incex a 1) *~ batch S176 
l, 
Selt.. feat, eee. i 
ImGgex * beaten size: (idex a7 1) * Daren size 


by 


name="train' 


Lest. score 4. = theano. function 
ei aker oo ie 

self.errors, 

givens={ 
SC lLisky esl scr x 
LMOex @ Datch sage. (index +f 1) * barteh 176 

], 
Scilis ys Lest Goel yi 
index. * DatCch size. (index + 1) * batch save 


by 


name='test' 


Veli, SCOre 1 = Eneano.j.func 1on( WOW! eBook 
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[index], 
self.errors, 
givens={ 
SCirek. Velie cee zai 
ImG@ex * beaten Size: (i7dex a 1) * Daren size 


J, 
selft.y: Valid seu yy 
INCex * Dealech Size? (ioex + 1) * batch size 


by 


name='valid' 


Ger VoL. Score.) : 
return [Veli Score 1242) £Or 42. tpxrenge(n Valid, batches) | 


Gel Lest. score); 
reture (teste Score i(4) 2or 2. ihxtange (a test bavecnes) | 


KeLCUEM Lrein tn, Valid score; test score 


With the training functionality in place, the following code initiates our stacked dA: 


HuMpy tno = nunmpy.rancdom.Randomstate (Ss 9677) 
print '... building the model' 
sda = SdA( 
numpy rng=numpy rng, 
i, Lhis=200, 
Hidden Layers: Sizeo=(240, 170, 100) 4 
i (OUCe=2 


It should be noted that, at this point, we should be trying an initial configuration of layer sizes to see 
how we do. In this case, the layer sizes used are the product of some initial testing. As we discussed, 
training the sda occurs in two stages. The first is a layer-wise pretraining process that loops over all 
of the sdA's layers. The second 1s a process of fine-tuning over validation and test data. 


To pretrain the sda, we provide the required corruption levels to train each layer and iterate over the 
layers using our previously defined pretraining fns: 


print '... getting the pretraining functions! 
Prevraiming 105 = Sda~.precraiuing FunclIiOnSsitirait sev x=Urain set x; 
Deen so tZe-Oorel! Sa7e) 


print '... pre-training the model' 
Sterk time: = Lime.clLock() 
CoOrrupeton Jtevele = Tele 22, a2) 


fOr 1 Jd} Rrange (sda. Layers) = 


FOr epoch, on. Xfange (pretraining epochs) % 
c= [] 


fOr Daten Index 12n xrange (1 ykRe a bweBeoiesg) * 


Ceappenc(Pretrarning fie (41.1 noex=beten 1nGaex, 
Corrupel1Oon—coOrruprtion tevelsli |, 
Ir=pretrain Lr) ) 

print 'Pre-training layer 1, epoch @d, cost ' % (1, epoch), 


print numpy.mean (c) 


end, Line = tLame.clock () 
print(('The pretraining code for file ' + 
os.path.split( file )[1] + ' ran for %.2fm' % ((end time - start time) / 


60.)), file = sys.stderr) 


At this point, we're able to initialize our sda class via calling the preceding code stored within this 
book's GitHub repository: MasteringMLWithPython/Chapter3/SdA.py 
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Assessing SdA performance 


The SdA will take a significant length of time to run. With 15 epochs per layer and each layer 
typically taking an average of 11 minutes, the network will run for around 500 minutes on a modern 
desktop system with GPU acceleration and a single-threaded GotoBLAS. 


On a system without GPU acceleration, the network will take substantially longer to train, and it is 


recommended that you use the alternative, which runs over a significantly smaller input dataset: 
MasteringMLWithPython/Chapter3/SdA no blas.py 


The results are of high quality, with a validation error score of 3.22% and test error score of 3.14%. 
These results are particularly impressive given the ambiguous and sometimes challenging nature of 
natural language processing applications. 


It was noticeable that the network classified more correctly for the 1-star and 5-star rating cases than 
for the intermediate levels. This 1s largely due to the ambiguous nature of unpolarized or unemotional 


language. 


Part of the reason that this input data was classifiable was via significant feature engineering. While 
time-consuming and sometimes problematic, we've seen that well-executed feature engineering 
combined with an optimized model can deliver an excellent level of accuracy. In Chapter 6, Jext 
Feature Engineering, we'll be applying the techniques used to prepare this dataset ourselves. 
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Further reading 


A well-informed overview of autoencoders (amongst other subjects) 1s provided by Quoc V. Le from 
the Google Brain team. Read about it at https://cs.stanford.edu/~quocle/tutorial2.pdf- 


This chapter used the Theano documentation available at http://deeplearning.net/tutorial/contents. html 
as a base for discussion as Theano was the main library used in this chapter. 


WOW! eBook 
www.wowebook.org 


Summary 


In this chapter, we introduced the autoencoder, an effective dimensionality reduction technique with 
Some unique applications. We focused on the theory behind the stacked denoised autoencoder, an 
extension of autoencoders whereby any number of autoencoders are stacked in a deep architecture. 
We were able to apply the stacked denoised autoencoder to a challenging natural language processing 
problem and met with great success, delivering highly accurate sentiment analysis of hotel reviews. 


In the next chapter, we will discuss supervised deep learning methods, including Convolutional 
Neural Networks (CNN). 
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Chapter 4. Convolutional Neural Networks 


In this chapter, you'll be learning how to apply the convolutional neural network (also referred to as 
the CNN or convnet), perhaps the best-known deep architecture, via the following steps: 


e Taking a look at the convnet's topology and learning processes, including convolutional and 
pooling layers 

e Understanding how we can combine convnet components into successful network architectures 

e Using Python code to apply a convnet architecture so as to solve a well-known image 
classification task 
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Introducing the CNN 


In the field of machine learning, there 1s an enduring preference for developing structures in code that 
parallel biological structures. One of the most obvious examples is that of the MLP neural network, 
whose topology and learning processes are inspired by the neurons of the human brain. 


This preference has turned out to be highly efficient; the availability of specialized, optimized 
biological structures that excel at specific sets of tasks gives us a wealth of templates and clues from 
which to design and create effective learning models. 


The design of convolutional neural networks takes inspiration from the visual cortex—the area of the 
brain that processes visual input. The visual cortex has several specializations that enable it to 
effectively process visual data; 1t contains many receptor cells that detect light in overlapping regions 
of the visual field. All receptor cells are subject to the same convolution operation, which is to say 
that they all process their input in the same way. These specializations were incorporated into the 
design of convnets, making their topology noticeably distinct from that of other neural networks. 


It's safe to say that CNN (convnets for short) are underpinning many of the most impactful current 
advances 1n artificial intelligence and machine learning. Variants of CNN are applied to some of the 
most sophisticated visual, linguistic, and problem-solving applications 1n existence. Some examples 
include the following: 


e Google has developed a range of specialized convnet architectures, including GoogLeNet, a 
22-layer convnet architecture. In addition, Google's DeepDream program, which became well- 
known for its overtrained, hallucinogenic imagery, also uses a convolutional neural network. 

e Convolutional nets have been taught to play the game Go (a long-standing AI challenge), 
achieving win-rates ranging between 85% and 91% against highly-ranked players. 

e Facebook uses convolutional nets 1n face verification (DeepFace ). 

e Baidu, Microsoft research, IBM, and Twitter are among the many other teams using convnets to 
tackle the challenges around trying to deliver next-generation intelligent applications. 


In recent years, object recognition challenges, such as the 2014 ImageNet challenge, have been 
dominated by winners employing specialized convnet implementations or multiple-model ensembles 
that combine convnets with other architectures. 


While we'll cover how to create and effectively apply ensembles in Chapter 8, Ensemble Methods, 
this chapter focuses on the successful application of convolutional neural networks to large-scale 
visual classification contexts. 
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Understanding the convnet topology 


The convolutional neural network's architecture should be fairly familiar; the network 1s an acyclic 
graph composed of layers of increasingly few nodes, where each layer feeds into the next. This will 
be very familiar from many well-known network topologies such as the MLP. 


Perhaps the most immediate difference between a convolutional neural network and most other 
networks is that all of the neurons 1n a convnet are identical! All neurons possess the same parameters 
and weight values. As you can see, this will immediately reduce the number of parameter values 
controlled by the network, bringing substantial efficiency savings. It also typically improves network 
learning rate as there are fewer free parameters to be managed and computed over. As we'll see later 
in this chapter, shared weights also enable a convnet to learn features irrespective of their position in 
the input (for example, the input image or audio signal). 


Another big difference between convolutional networks and other architectures is that the 
connectivity between nodes is limited such as to develop a spatially local connectivity pattern. In 
other words, the inputs to a given node will be limited to only those nodes whose receptor fields are 
contiguous. This may be spatially contiguous, as in the case of image data; 1n such cases, each 
neuron's inputs will ultimately draw from a continuous subset of the image. In the case of audio signal 
data, the input might instead be a continuous window of time. 


To illustrate this more clearly, let's take an example input image and discuss how a convolutional 
network might process parts of that image across specific nodes. Nodes in the first layer of a 
convolutional neural network will be assigned subsets of the input image. In this case, let's say that 
they take a 3 x 3 pixel subset of the image each. Our coverage covers the entire image without any 
overlap between the areas taken as input by nodes and without any gaps. (Note that none of these 
conditions are automatically true for convnet implementations.) Each node 1s assigned a 3 x 3 pixel 
subset of the image (the receptive field of the node) and outputs a transformed version of that input. 
We'll disregard the specifics of that transformation for now. 


This output is usually then picked up by a second layer of nodes. In this case, let's say that our second 
layer 1s taking a subset of all of the outputs from nodes in the first layer. For example, it might be 
taking a contiguous 6 x 6 pixel subset of the original image; that is, it has a receptive field that covers 
the outputs of exactly four nodes from the preceding layer. This becomes a little more intuitive when 
explained visually: 
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Each layer is composable; the output of one convolutional layer may be fed into the next layer as an 
input. This provides the same effect that we saw 1n the Chapter 3, Stacked Denoising Autoencoders; 
successive layers develop representations of increasingly high-level, abstract features. Furthermore, 
as we build downward—adding layers—the representation becomes responsive to a larger region of 
pixel space. Ultimately, by stacking layers, we can work our way toward global representations of the 
entire input. 


Understanding convolution layers 


As described, 1n order to prevent each node from learning an unpredictable (and difficult to tune!) set 
of very local, free parameters, weights 1n a layesvaressbared across the entire layer. To be completely 
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precise, the filters applied in a convolutional layer are a single set of filters, which are slid 
(convolved) across the input dataset. This produces a two-dimensional activation map of the input, 
which is referred to as the feature map. 


The filter itself 1s subject to four hyperparameters: size, depth, stride, and zero-padding. The size of 
the filter 1s fairly self-explanatory, being the area of the filter (obviously, found by multiplying height 
and width; a filter need not be square!). Larger filters will tend to overlap more, and as we'll see, this 
can improve the accuracy of classification. Crucially, however, increasing the filter size will create 
increasingly large outputs. As we'll see, managing the size of outputs from convolutional layers is a 
huge factor in controlling the efficiency of a network. 


Depth defines the number of nodes in the layer that connect to the same region of the input. The trick 
to understanding depth is to recognize that looking at an image (for people or networks) involves 
processing multiple different types of property. Anyone who has ever looked at all the image 
adjustment sliders in Photoshop has an idea of what this might entail. Depth is sometimes referred to 
as a dimension in its own right; 1t almost relates to the complexity of an image, not 1n terms of its 
contents but in terms of the number of channels needed to accurately describe it. 


It's possible that the depth might describe color channels, with nodes mapped to recognize green, 
blue, or red 1n the input. This, incidentally, leads to a common convention where depth is set to three 
(particularly at the first convolution layer). It's very important to recognize that some nodes commonly 
learn to express less easily-described properties of input images that happen to enable a convnet to 
learn that image more accurately. Increasing the depth hyperparameter tends to enable nodes to 
encode more information about inputs, with the attendant problems and benefits that you might expect. 


As aresult, setting the depth parameter to too small a value tends to lead to poor results because the 
network doesn't have the expressive depth (in terms of channel count) required to accurately 
characterize input data. This is a problem analogous to not having enough features, except that it's 
more easily fixed; one can tune the depth of the network upward to improve the expressive depth of 
the convnet. 


Equally, setting the depth parameter to too small a value can be redundant or harmful to performance, 
thereafter. If in doubt, consider testing the appropriate depth value during network configuration via 
hyperparameter optimization, the elbow method, or another technique. 


Stride is a measure of spacing between neurons. A stride value of one will lead every element of the 
input (for an image, potentially every pixel) to be the center of a filter instance. This naturally leads to 
a high degree of overlap and very large outputs. Increasing the stride causes less of an overlap 1n the 
receptive fields and the output's size is reduced. While tuning the stride of a convnet is a question of 
weighing accuracy against output size, it can generally be a good idea to use smaller strides, which 
tend to work better. In addition, a stride value of one enables us to manage down-sampling and scale 
reduction at pooling layers (as we'll discuss later 1n the chapter). 


The following diagram graphically displays b@thDepth and Stride: 
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Depth = 2 
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The final hyperparameter, zero-padding, offers an interesting convenience. Zero-padding 1s the 
process of setting the outer values (the border) of each receptive field to zero, which has the effect of 
reducing the output size for that layer. It's possible to set one, or multiple, pixels around the border of 
the field to zero, which reduces the output size accordingly. There are, of course, limits; obviously, 
it's not a good idea to set zero-padding and stride such that areas of the input are not touched by a 
filter! More generally, increasing the degree of zero-padding can cause a decrease in effectiveness, 
which is tied to the increased difficulty of learning features via coarse coding. (Refer to the 
Understanding pooling layers section in this chapter.) 


However, zero-padding is very helpful because it enables us to adjust the input and output sizes to be 
the same. This is a very common practice; using zero-padding to ensure that the size of the input layer 
and output layer are equal, we are able to easily manage the stride and depth values. Without using 
zero-padding in this way, we would need to do a lot of work tracking input sizes and managing 
network parameters simply to make the network function correctly. In addition, zero-padding also 
improves performance as, without it, a convnet will tend to gradually degrade content at the edges of 
the filter. 


In order to calibrate the number of nodes, appropriate stride, and padding for successive layers when 
we define our convnet, we need to know the size of the output from the preceding layer. We can 
calculate the spatial size of a layer's output (O) as a function of the input image size (W), filter size 
(F’), stride (S), and the amount of zero-padding applied (P), as follows: 

O- W-—F+2P 
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If O 1s not an integer, the filters do not tile across the input neatly and instead extend over the edge of 
the input. This can cause some problematic issues when training (normally involving thrown 
exceptions)! By adjusting the stride value, one can find a whole-number solution for O and train 
effectively. It 1s normal for the stride to be constrained to what is possible given the other 
hyperparameter values and size of the input. 


We've discussed the hyperparameters involved in correctly configuring the convolutional layer, but 
we haven't yet discussed the convolution process itself. Convolution is a mathematical operator, like 
addition or derivation, which 1s heavily used in signal processing applications and in many other 
contexts where its application helps simplify complex equations. 


Loosely speaking, convolution is an operation over two functions, such as to produce a third function 
that 1s a modified version of one of the two original functions. In the case of convolution within a 
convnet, the first component is the network's input. In the case of convolution applied to images, 
convolution is applied in two dimensions (the width and height of the image). The input image 1s 
typically three matrices of pixels—one for each of the red, blue, and green color channels, with 
values ranging between 0 and 255 in each channel. 


Note 


At this point, it's worth introducing the concept of a tensor. Tensor is a term commonly used to refer 
to an n-dimensional array or matrix of input data, commonly applied in deep learning contexts. It's 
effectively analogous to a matrix or array. We'll be discussing tensors in more detail, both in this 
chapter and in Chapter 9, Additional Python Machine Learning Tools (where we review the 
TensorFlow library). It's worth noting that the term tensor is noticing a resurgence of use in the 
machine learning community, largely through the influence of Google machine intelligence research 
teams. 


The second input to the convolution operation 1s the convolution kernel, a single matrix of floating 
point numbers that acts as a filter on the input matrices. The output of this convolution operation is the 
feature map. The convolution operation works by sliding the filter across the input, computing the dot 
product of the two arguments at each instance, which 1s written to the feature map. In cases where the 
stride of the convolutional layer 1s one, this operation will be performed across each pixel of the 
input image. 


The main advantage of convolution 1s that it reduces the need for feature engineering. Creating and 
managing complex kernels and performing the highly specialized feature engineering processes 
needed 1s a demanding task, made more challenging by the fact that feature engineering processes that 
work well in one context can work poorly in most others. While we discuss feature engineering in 
detail in Chapter 7, Feature Engineering Part IT, convolutional nets offer a powerful alternative. 


CNN, however, incrementally improve their keavel Babislity to filter a given input, thus automatically 
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optimizing their kernel. This process is accelerated by learning multiple kernels in parallel at once. 
This 1s feature learning, which we've encountered in previous chapters. Feature learning can offer 
tremendous advantages in time and in increasing the accessibility of many problems. As with our 
earlier SDA and DBN implementations, we would look to pass our learned features to a much 
simpler, shallow neural network, which uses these features to classify the input image. 


Understanding pooling layers 


Stacking convolutional layers allows us to create a topology that effectively creates features as 
feature maps for complex, noisy input data. However, convolutional layers are not the only 
component of a deep network. It is common to weave convolutional layers in with pooling layers. 
Pooling is an operation over feature maps, where multiple feature values are aggregated into a single 
value—mostly using a max (max-pooling), mean (mean-pooling), or summation (sum-pooling) 
operation. 


Pooling 1s a fairly natural approach that offers substantial advantages. If we do not aggregate feature 
maps, we tend to find ourselves with a huge amount of features. The CIFAR-10 dataset that we'll be 
classifying later 1n this chapter contains 60,000 32 x 32 pixel images. If we hypothetically learned 
200 features for each image—over 8 x 8 inputs—then at each convolution, we'd find ourselves with 
an output vector of size (32 — 8+1) * (32 —8+1) * 200, or 125,000 features per image. Convolution 
produces a huge amount of features that tend to make computation very expensive and can also 
introduce significant overfitting problems. 


The other major advantage provided by a pooling operation 1s that it provides a level of robustness 
against the many, small deviations and variances that occur in modeling noisy, high-dimensional data. 
Specifically, pooling prevents the network learning the position of features too specifically 
(overfitting), which is obviously a critical requirement in image processing and recognition settings. 
With pooling, the network no longer fixates on the precise location of features 1n the input and gains a 
greater ability to generalize. This is called translation-invariance. 


Max-pooling is the most commonly applied pooling operation. This is because it focuses on the most 
responsive features 1n question that should, in theory, make it the best candidate for image recognition 
and classification purposes. By a similar logic, min-pooling tends to be applied in cases where it is 
necessary to take additional steps to prevent an overly sensitive classification or overfitting from 
occurring. 


For obvious reasons, it's prudent to begin modeling using a quickly applied and straightforward 
pooling method such as max-pooling. However, when seeking additional gains in network 
performance during later iterations, it's important to look at whether your pooling operations can be 
improved on. There isn't any real restriction 1n terms of defining your own pooling operation. Indeed, 
finding a more effective subsampling method or alternative aggregation can substantially improve the 
performance of your model. 


In terms of theano code, a max-pooling implementation is pretty straightforward and may look like 
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this: 
from theano.tensor.signal import downsample 


input = T.dtensor4 ('input') 


Maxpool, Shape: = (2, 2) 
pool OuL = cCownsaemple.max pool Zoiinpul, Maxpool Shape, 2gnore border=T7ue) 
£ = (Cheano. fUunceLon ClanpuTcl,pook oul) 


The max pool 2d function takes an n-dimensional tensor and downscaling factor, in this case, 
input and maxpool_ shape, with the latter being a tuple of length 2, containing width and height 
downscaling factors for the input image. The max pool 2d operation then performs max-pooling 
over the two trailing dimensions of the vector: 


invals = numpy.random.RandomState(1l).rand(3, 2, 5, 5) 
pool out. = downsample.max pool Zd(input, maxpool shape, agnore: border=Fralse) 
f= tCheano~funct1on (| tnput)|,poecl our) 


The ignore border determines whether the border values are considered or discarded. This max- 
pooling operation produces the following, given that ignore border = True: 


[[ 0.72032449 0.39676747] 
[ 0.6852195 0.87811744] ] 


As you can see, pooling is a straightforward operation that can provide dramatic results (in this case, 
the input was a 5 x 5 matrix, reduced to 2 x 2). However, pooling 1s not without critics. In particular, 
Geoffrey Hinton offered this pretty delightful soundbite: 


"The pooling operation used in convolutional neural networks is a big mistake and the fact 
that it works so well is a disaster. 


If the pools do not overlap, pooling loses valuable information about where things are. We 
need this information to detect precise relationships between the parts of an object. Its true 
that if the pools overlap enough, the positions of features will be accurately preserved by 
"coarse coding" (see my paper on "distributed representations" in 1986 for an explanation of 
this effect). But I no longer believe that coarse coding is the best way to represent the poses 
of objects relative to the viewer (by pose I mean position, orientation, and scale)." 


This is a bold statement, but it makes sense. Hinton's telling us that the pooling operation, as an 
ageregation, does what any aggregation necessarily does—it reduces the data to a simpler and less 
informationally-rich format. This wouldn't be too damaging, except that Hinton goes further. 


Even if we'd reduced the data down to single values for each pool, we could still hope that the fact 
that multiple pools overlap spatially would still present feature encodings. (This is the coarse coding 
referred to by Hinton.) This 1s also quite an intuitive concept. Imagine that you're listening 1n to a 
signal on a noisy radio frequency. Even if you ¢ onl caught one word in three, it's probable that you'd 


be able to distinguish a distress signal fronathe.s i XpLas. fogecast! 


However, Hinton follows up by observing that coarse coding 1s not as effective in learning pose 
(position, orientation, and scale). There are so many permutations 1n viewpoint relative to an object 
that it's unlikely two images would be alike and the sheer variety of possible poses becomes a 
challenge for a convolutional network using pooling. This suggests that an architecture that does not 
overcome this challenge may not be able to break past an upper limit for image classification. 


However, the general consensus, at least for now, is that even after acknowledging all of this, it is 
still highly advantageous in terms of efficiency and translation-invariance to continue using pooling 
operations in convnets. Right now, the argument goes that it's the best we have! 


Meanwhile, Hinton proposed an alternative to convnets in the form of the transforming autoencoder. 
The transforming autoencoder offers accuracy improvements on learning tasks that require a high 
level of precision (such as facial recognition), where pooling operations would cause a reduction 1n 
precision. The Further reading section of this chapter contains recommendations 1f you are interested 
in learning more about the transforming autoencoder. 


So, we've spent quite a bit of time digging into the convolutional neural network—its components, 
how they work, and their hyperparameters. Before we move on to put the theory into action, it's worth 
discussing how all of these theoretical components fit together into a working architecture. To do this, 
let's discuss what training a convnet looks like. 


‘Training a convnet 


The means of training a convolutional network will be familiar to readers of the preceding chapters. 
The convolutional architecture itself 1s used to pretrain a simpler network structure (for example, an 
MLP). The backpropagation algorithm is the standard method to compute the gradient when 
pretraining. During this process, every layer undertakes three tasks: 


e Forward pass: Each feature map 1s computed as a sum of all feature maps convolved with the 
corresponding weight kernel 

e Backward pass: The gradients respective to inputs are calculated by convolving the transposed 
weight kernel with the gradients, with respect to the outputs 

e The loss for each kernel 1s calculated, enabling the individual weight adjustment of every kernel 
as needed 


Repetition of this process allows us to achieve increasing kernel performance until we reach a point 
of convergence. At this point, we will hope to have developed a set of features sufficient that the 
capping network is able to effectively classify over these features. 


This process can execute slowly, even on a fairly advanced GPU. Some recent developments have 
helped accelerate the training process, including the use of the Fast Fourier Transform to accelerate 
the convolution process (for cases where the convolution kernel is of roughly equal size to the input 
image). 


Putting it all together 
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So far, we've discussed some of the elements required to create a CNN. The next subject of 
discussion should be how we go about combining these components to create capable convolutional 
nets as well as which combinations of components can work well. We'll draw guidance froma 
number of forerunning convnet implementations as we build an understanding of what is commonly 
done as well as what is possible. 


Probably the best-known convolutional network implementation 1s Yann LeCun's LeNet. LeNet has 
gone through several iterations since LeNet-1 in late 1980, but has been increasingly effective at 
performing tasks including handwritten digit and image classification. LeNet 1s structured using 
alternating convolution and pooling layers capped by an MLP, as follows: 


Input layer (3!) 4 feature maps 


(Cl) 4 feature maps (52) 6 feature maps (C2) 6 feature maps 


ce 
aT Ta 





convolution layer , sub-sampling layer convolution layer sub-sampling layer | fully connected MLP ) 


Each layer 1s partially-connected, as we discussed earlier, with the MLP being a fully connected 
layer. At each layer, multiple feature maps (channels) are employed; this gives us the advantage of 
being able to create more complex sets of filters. As we'll see, using multiple channels within a layer 
is a powerful technique employed in advanced use cases. 


It's common to use max-pooling layers to reduce the dimensionality of the output to match the input as 
well as generally manage output volumes. How pooling 1s implemented, particularly in regard to the 
relative position of convolutional and pooling layers, 1s an element that tends to vary between 
implementations. It's generally common to develop a layer as a set of operations that feed into, and 
are fed into, a single Fully Connected layer, as shown in the following example: 
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Fully Connected 
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3x3 Max Pooling 


Previous Layer 





While this network structure wouldn't work 1n practice, it's a helpful illustration of the fact that a 
network can be constructed from the components you've learned about in a number of ways. How this 
network is structured and how complex it becomes should be motivated by the challenge the network 
is intended to solve. Different problems can call for very different solutions. 


In the case of the LeNet implementation that we'll be working with later in this chapter, each layer 
contains multiple convolutional layers in parallel with a max-pooling layer following each. 
Diagrammatically, a LeNet layer looks like the following image: 
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Fully Connected 


2x2 Max Pooling 
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4x4 Convolutions 4x4 Convolutions 4x4 Convolutions 


Previous Layer 





This architecture will enable us to start looking at some initial use cases quickly and easily, but in 
general won't perform well for some of the state-of-the-art applications we'll run into later 1n this 
book. Given this fact, there are some more extensive deep learning architectures designed to tackle 
the most challenging problems, whose topologies are worth discussing. One of the best-known 
convnet architectures is Google's Inception network, now more commonly known as GoogLeNet. 


GoogLeNet was designed to tackle computer vision challenges involving Internet-quality image data, 
that 1s, images that have been captured 1n real contexts where the pose, lighting, occlusion, and clutter 
of images vary significantly. GoogLeNet was applied to the 2014 ImageNet challenge with 
noteworthy success, achieving only 6.7% error rate on the test dataset. ImageNet images are small, 
high- granularity images taken from many, varied classes. Multiple classes may appear very similar 
(such as varieties of tree) and the network architecture must be able to find increasingly challenging 
class distinctions to succeed. For a concrete example, consider the following ImageNet image: 





Given the demands of this problem, the GoogLeNet architecture used to win ImageNet 14 departs 
from the LeNet model in several key ways. GOGBEERSS basic layer design is known as the Inception 


webook.or 


module and is made up of the following components: 
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The 1 x 1 convolutional layers used here are followed by Rectified Linear Units (ReLU). This 
approach is heavily used in speech and audio modeling contexts as ReLU can be used to effectively 
train deep models without pretraining and without facing some of the gradient vanishing problems that 
challenge other activation types. More information on ReLU 1s provided in the Further reading 
section of this chapter. The DepthConcat element provides a concatenation function, which 
consolidates the outputs of multiple units and substantially improves training time. 


GoogLeNet chains layers of this type to create a full network. Indeed, the repetition of inception 
modules through GoogLeNet (nine times!) suggests that Network In Network (NIN) (deep 
architectures created from chained network modules) approaches are going to continue to be a serious 
contender in deep learning circles. The paper describing GoogLeNet and demonstrating how 
inception models were integrated into the network is provided in the Further reading section of this 
chapter. 


Beyond the regularity of Inception module stacking, GoogLeNet has a few further surprises to throw 
at us. The first few layers are typically more straightforward with single-channel convolutional and 
max-pooling layers used at first. Additionally, at several points, GoogLeNet introduced a branch off 
the main structure using an average-pool layer, feeding into auxiliary softmax classifiers. The purpose 
of these classifiers was to improve the gradient signal that gets propagated back 1n lower layers of the 
network, enabling stronger performance at the early and middle network layers. Instead of one huge 
and potentially vague backpropagation process stemming from the final layer of the network, 


GoogLeNet instead has several intermediary URUALE Suu ces. 


What's really important to take from this implementation is that GoogLeNet and other top convnet 
architectures are mainly successful because they are able to find effective configurations using the 
highly available components that we've discussed in this chapter. Now that we've had a chance to 
discuss the architecture and components of a convolutional net and the opportunity to discuss how 
these components are used to construct some highly advanced networks, it's time to apply the 
techniques to solve a problem of our own! 
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Applying a CNN 


We'll be working with image data to try out our convnet. The image data that we worked with 1n 
earlier chapters, including the MNIST digit dataset, was a useful training dataset (with many valuable 
real-world applications such as automated check reading!). However, it differs from almost all 
photographic or video data in an important way; most visual data is highly noisy. 


Problem variables can include pose, lighting, occlusion, and clutter, which may be expressed 
independently or 1n conjunction in huge variety. This means that the task of creating a function that 1s 
invariant to all properties of noise in the dataset 1s challenging; the function 1s typically very complex 
and nonlinear. In Chapter 7, Feature Engineering Part I, we'll discuss how techniques such as 
whitening can help mitigate some of these challenges, but as we'll see, even such techniques by 
themselves are insufficient to yield good classification (at least, without a very large investment of 
time!). By far, the most efficient solution to the problem of noise in image data, as we've already seen 
in multiple contexts, is to use a deep architecture rather than a broad one (that is, a neural network 
with few, high-dimensional layers, which is vulnerable to problematic overfitting and generalizability 
problems). 


From discussions 1n previous chapters, the reasons for a deep architecture may already be clear; 
successive layers of a deep architecture reuse the reasoning and computation performed 1n preceding 
layers. Deep architectures can thus build a representation that 1s sequentially improved by successive 
layers of the network without performing extensive recalculation on any individual layer. This makes 
the challenging task of classifying large datasets of noisy photograph data achievable to a high level 
of accuracy in a relatively short time, without extensive feature engineering. 


Now that we've discussed the challenges of modeling image data and advantages of a deep 
architecture in such contexts, let's apply a convnet to a real-world classification problem. 


As 1n preceding chapters, we're going to start out with a toy example, which we'll use to familiarize 
ourselves with the architecture of our deep network. This time, we're going to take on a classic image 
processing challenge, CIFAR-10. CIFAR-10 is a dataset of 60,000 32 x 32 color images in 10 
classes, with each class containing 6,000 images. The data is already split into five training batches, 
with one test batch. The classes and some images from each dataset are as follows: 
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While the industry has—to an extent—moved on to tackle other datasets such as ImageNet, CIFAR-10 
was long regarded as the bar to reach in terms of image classification, with a great many data 
scientists attempting to create architectures that classify the dataset to human levels of accuracy, 
where human error rate 1s estimated at around 6%. 


In November 2014, Kaggle ran a contest whose objective was to classify CIFAR-10 as accurately as 
possible. This contest's highest-scoring entry produced 95.55% classification accuracy, with the 
result using convolutional networks and a Network-in-Network approach. We'll discuss the challenge 
of classifying this dataset, as well as some of the more advanced techniques we can bring to bear, in 
Chapter 8, Ensemble Methods; for now, let's begin by having a go at classification witha 
convolutional network. 


For our first attempt, we'll apply a fairly simple convolutional network with the following objectives: 


e Applying a filter to the image and view the output 
e Seeing the weights that our convnet created 
e Understanding the difference between the outputs of effective and ineffective networks 


In this chapter, we're going to take an approach that we haven't taken before, which will be of huge 
importance to you when you come to use these techniques 1n the wild. We saw earlier in this chapter 


how the deep architectures developed to solve different problems may differ structurally in many 
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It's important to be able to create problem-specific network architectures so that we can adapt our 
implementation to fit a range of real-world problems. To do this, we'll be constructing our network 
using components that are modular and can be recombined 1n almost any way necessary, without too 
much additional effort. We saw the impact of modularity earlier in this chapter, and it's worth 
exploring how to apply this effect to our own networks. 


As we discussed earlier in the chapter, convnets become particularly powerful when tasked to 
classify very large and varied datasets of up to tens or hundreds of thousands of images. As such, let's 
be a little ambitious and see whether we can apply a convnet to classify CIFAR-10. 


In setting up our convolutional network, we'll begin by defining a useable class and initializing the 
relevant network parameters, particularly weights and biases. This approach will be familiar to 
readers of the preceding chapters. 


class LeNetConvPoolLayer (object): 


Get. .1nat =(selt,; ©2nG;, 2apur, fier shape, tmedge shape, 


poolsize=(2, 2)): 


assert image shape([l] == filter shape| 1] 
self.input = input 


fan, Tn. = NUMpys.proc (ialeer Sheps |i. |) 
fa Out = {fiJeer Shape, )) * numpy.prod( fa! ter suape|2 |) 
numpy.prod(poolsize) ) 


W bound = numpy.sqrt(6. / (fan _in + fan out) ) 
self.W = theano.shared ( 
numpy.asarray ( 
rng.uniform(low=-W_bound, high=W_bound, 
SiI76-L£11ter shape), 
dtype=theano.config.floatx 
)y 


borrow—True 


) 


Before moving on to create the biases, it's worth reviewing what we have thus far. The 
LeNetConvPoolLayer Class 1s intended to implement one full convolutional and pooling layer as per 
the LeNet layer structure. This class contains several useful initial parameters. 


From previous chapters, we're familiar with the rng parameter used to initialize weights to random 
values. We can also recognize the input parameter. As 1n most cases, image input tends to take the 
form of a symbolic image tensor. This image input is shaped by the image shape parameter; this is a 
tuple or list of length 4 describing the dimensions of the input. As we move through successive layers, 
image shape will reduce increasingly. As a tuple, the dimensions of image shape simply specify 
the height and width of the input. As a list of length 4, the parameters, in order, are as follows: 


e The batch size 


e The number of input feature maps WOW! eBook 
www.wowebook.org 


e The height of the input image 
e The width of the input image 


While image shape specifies the size of the input, filter shape specifies the dimensions of the 
filter. As a list of length 4, the parameters, in order, are as follows: 


The number of filters (channels) to be applied 
The number of input feature maps 

The height of the filter 

The width of the filter 


However, the height and width may be entered without any additional parameters. The final parameter 
here, poolsize, describes the downsizing factor. This is expressed as a list of length 2, the first 
element being the number of rows and the second—the number of columns. 


Having defined these values, we immediately apply them to define the LeNetConvPoolLayer class 
better. In defining fan in, we Set the inputs to each hidden unit to be a multiple of the number of input 
feature maps—the filter height and width. Simply enough, we also define fan out, a gradient that's 
calculated as a multiple of the number of output feature maps—the feature height and width—divided 
by the pooling size. 


Next, we move on to defining the bias as a set of one-dimensional tensors, one for each output feature 
map: 


b Values = numpyY.Zeros ((Ttilver shape ll ,), 
dtype=theano.config.floatxX) 
Selisb) = tnheanoO.sharecatvaeluc= 6 Values, DbOorrow—T1ue) 


Cony “Out. = Convy.convyZa\ 
input=input, 
filters=self.w, 

PiLEer shepe=rilter shape; 
image sSshape=image shape 


) 


With this single function call, we've defined a convolution operation that uses the filters we 
previously defined. At times, it can be a little staggering to see how much theory needs to be known to 
effectively apply a single function! The next step is to create a similar pooling operation using 


max pool 2d. 


pooled out = downsample.max pool 2d ( 
LApuULT=Cony Out, 
ds=poolsize, 
IGnore bOorder=iIruc 


sell sOUrpuCc = T.tank (pooled oul + seli.zb.Gimsnuttle(*x’, 
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self.params = [self.W, self.b] 
self.input = input 
Finally, we add the bias term, first reshaping it to be a tensor of shape (1, n filters, 1,1). This has 


the simple effect of causing the bias to affect every feature map and minibatch. At this point, we have 
all of the components we need to build a basic convnet. Let's move on to create our own network: 


xX = T.ematrix('x') 
Y T.1vector('y") 


This process 1s fairly simple. We build the layers 1n order, passing parameters to the class that we 
previously specified. Let's start by building our first layer: 


Leaver? Ipue = Retresnape( (Dace Size; ky 327 22) ) 


layerO = LeNetConvPoolLayer ( 
Ing, 
InpuUt=Layerd. aniput, 
Image. Shape=(balen e176, Ip D2 22); 
Filter Shape=(nkerns (Ol, ty of 0) 
poolsize=(2, 2) 

) 


We begin by reshaping the input to spread it across all of the intended minibatches. As the CIFAR-10 
images are of a 32 x 32 dimension, we've used this input size for the height and width dimensions. 
The filtering process reduces the size of this input to 32- 5+1 in each dimension, or 28. Pooling 
reduces this by half 1n each dimension to create an output layer of shape (batch size, nkerns[0], 
id Asay: 


This 1s a completed first layer. Next, we can attach a second layer to this using the same code: 


layerl = LeNetConvPoollLayer ( 
rng, 
input=layerO.output, 
image shape= (batch size, nkernis| 0] 
filer shape=—(nkerns (Li, neers |0:) 
poolsize=(2, 2) 


~ ~ 
Ol 
~ 

Ol 

~ 


) 


As per the previous layer, the output shape for this layer is (batch size, nkerns[1], 5, 5).SO 

far, so good! Let's feed this output to the next, fully-connected sigmoid layer. To begin with, we need 
to flatten the input shape to two dimensions. With the values that we've fed to the network so far, the 

input will be a matrix of shape (500, 1250). As such, we'll set up an appropriate layer2: 


layer? 1npuL = leavyerl.ourpulc.tilatien (2) 


layer2 = HiddenLayer ( 
Enoy 


INnpUL=Layer2 anpur, WOW! eBook 
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nh IoSAnkeris || ~ > = oO 
ik OUe= 000, 
activation=T.tanh 


) 


This leaves us 1n a good place to finish this network's architecture, by adding a final, logistic 
regression layer that calculates the values of the fully-connected sigmoid layer. 


Let's try out this code: 


x = T.Metrix(CLEAR-10 Crain) 
y = Tetveclor(ClPAR=10 Ces) 


Chapter 4/convolutional mlp.py 


The results that we obtained were as follows: 


Optimization complete. 

Best validation score of 0.885725 % obtained at iteration 17400, with test 
performance 0.902508 % 

The code for file convolutional mlp.py ran for 26.50m 


This accuracy score, at validation, is reasonably good. It's not at a human level of accuracy, which, as 
we established, is roughly 94%. Equally, it is not the best score that we could achieve with a convnet. 


For instance, the Further Reading section of this chapter refers to a convnet implemented in Torch 
using a combination of dropout (which we studied in Chapter 3, Stacked Denoising Autoencoders) 
and Batch Normalization (a normalization technique intended to reduce covariate drift during the 
training process; refer to the Further Reading section for further technical notes and papers on this 
technique), which scored 92.45% validation accuracy. 


A score of 88.57% is, however, in the same ballpark and can give us confidence that we're within 
striking distance of an effective network architecture for the CIFAR-10 problem. More importantly, 
you've learned a lot about how to configure and train a convolutional neural network effectively. 
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Further Reading 


The glut of recent interest 1n Convolutional Networks means that we're spoiled for choice for further 
reading. One good option for an unfamiliar reader is the course notes from Andrej Karpathy's course: 


http://cs23 In. github.10/convolutional-networks/. 


For readers with an interest in the deeper details of specific best-in-class implementations, some of 
the networks referenced 1n this chapter were the following: 


Google's GoogLeNet (http://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) 


Google Deepmind's Go-playing program AlphaGo (https://gogameguru.com/1/2016/03/deepmind- 
mastering- go. pdf) 


Facebook's DeepFace architecture for facial recognition 
(https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf) 


The ImageNet LSVRC-2010 contest winning network, described here by Krizhevsky, Sutskever and 
Hinton (http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf) 


Finally, Sergey Zagoruyko's Torch implementation of a ConvNet with Batch normalization is 


available here: http://torch.ch/blog/2015/07/30/cifar.html. 
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Summary 


In this chapter, we covered a lot of ground. We began by introducing a new kind of neural network, 
the convnet. We explored the theory and architecture of a convnet in the most ubiquitous form and also 
by discussing some state-of the-art network design principles that have been developing as recently 
as mid-2015 in organizations such as Google and Baidu. We built an understanding of the topology 
and also of how the network operates. 


Following this, we began to work with the convnet itself, applying it to the CIFAR-10 dataset. We 
used modular convnet code to create a functional architecture that reached a reasonable level of 
accuracy in classifying 10-class image data. While we're definitely still at some distance from human 
levels of accuracy, we're gradually closing the gap! Chapter 8, Ensemble Methods will pick up from 
what you learned here, taking these techniques and their application to the next level. 
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Chapter 5. Semi-Supervised Learning 


WOW! eBook 
www.wowebook.org 


Introduction 


In previous chapters, we've tackled a range of data challenges using advanced techniques. In each 
case, we've applied our techniques to datasets with reasonable success. 


In many regards, though, we've had it pretty easy. Our data has been largely derived from canonical 
and well-prepared sources so we haven't had to do a great deal of preparation. In the real world, 
though, there are few datasets like this (except, perhaps, the ones that we're able to specify 
ourselves!). In particular, it is rare and improbable to come across a dataset 1n the wild, which has 
class labels available. Without labels on a sufficient portion of the dataset, we find ourselves unable 
to build a classifier that can accurately predict labels on validation or test data. So, what do we do? 


The common solution 1s attempt to tag our data manually; not only is this time-consuming, but it also 
suffers from certain types of human error (which are especially common with high-dimensional 
datasets, where a human observer is unable to identify class boundaries as well as a computational 
approach might). 


A fairly new and quite exciting alternative approach is to use semi-supervised learning to apply 
labels to unlabeled data via capturing the shape of underlying distributions. Semi-supervised learning 
has been growing 1n popularity over the last decade for its ability to save large amounts of annotation 
time, where annotation, if possible, may potentially require human expertise or specialist equipment. 
Contexts where this has proven to be particularly valuable have been natural language parsing and 
speech signal analysis; in both areas, manual annotation has proven to be complex and time- 
consuming. 


In this chapter, you're going to learn how to apply several semi-supervised learning techniques, 
including, Contrastive Pessimistic Likelihood Estimation (CPLE), self learning, and S3VM. These 
techniques will enable us to label training data 1n a range of otherwise problematic contexts. You'll 
learn to identify the capabilities and limitations of semi-supervised techniques. We'll use a number of 
recent Python libraries developed on top of scikit-learn to apply semi-supervised techniques to 
several use cases, including audio signal data. 


Let's get started! 
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Understanding semi-supervised learning 


The most persistent cost in performing machine learning 1s the creation of tagged data for training 
purposes. Datasets tend not to come with class labels provided due to the circularity of the situation; 
one needs a trained classification technique to generate class labels, but cannot train the technique 
without labeled training and test data. As mentioned, tagging data manually or via test processes 1s 
one option, but this can be prohibitively time-consuming, costly (particularly for medical tests), 
challenging to organize, and prone to error (with large or complex datasets). Semi-supervised 
techniques suggest a better way to break this deadlock. 


Semi-supervised learning techniques use both unlabeled and labeled data to create better learning 
techniques than can be created with either unlabeled or labeled data individually. There is a family of 
techniques that exists 1n a space between supervised (with labeled data) and unsupervised (with 
unlabeled data) learning. 


The main types of technique that exist in this group are semi-supervised techniques, transductive 
techniques, and active learning techniques, as well as a broad set of other methods. 


Semi-supervised techniques leave a set of test data out of the training process so as to perform testing 
at a later stage. Transductive techniques, meanwhile, are purely intended to develop labels for 
unlabeled data. There may not be a test process embedded in a transductive technique and there may 
not be labeled data available for use. 


In this chapter, we'll focus on a set of semi-supervised techniques that deliver powerful dataset 
labeling capability in very familiar formats. A lot of the techniques that we'll be discussing are 
useable as wrappers around familiar, pre-existing classifiers, from linear regression classifiers to 
SVMs. As such, many of them can be run using estimators from Scikit-learn. We'll begin by applying a 
linear regression classifier to test cases before moving on to apply an SVM with semi-supervised 
extensions. 
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Semi-supervised algorithms in action 


We've discussed what semi-supervised learning is, why we want to engage in it, and what some of the 
general realities of employing semi-supervised algorithms are. We've gone about as far as we can 
with general descriptions. Over the next few pages, we'll move from this general understanding to 
develop an ability to use a semi-supervised application effectively. 
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Self-training 


Self-training 1s the simplest semi-supervised learning method and can also be the fastest. Self-training 
algorithms see an application in multiple contexts, including NLP and computer vision; as we'll see, 
they can present both substantial value and significant risks. 


The objective of self-training 1s to combine information from unlabeled cases with that of labeled 
cases to iteratively identify labels for the dataset's unlabeled examples. On each iteration, the labeled 
training set 1s enlarged until the entire dataset is labeled. 


The self-training algorithm is typically applied as a wrapper to a base model. In this chapter, we'll be 
using an SVM as the base for our self-training model. The self-training algorithm is quite simple and 
contains very few steps, as follows: 


1. Aset of labeled data 1s used to predict labels for a set of unlabeled data. (This may be all 
unlabeled data or part of it.) 

. Confidence is calculated for all newly labeled cases. 

. Cases are selected from the newly labeled data to be kept for the next iteration. 

. The model trains on all labeled cases, including cases selected in previous iterations. 

. The model iterates through steps | to 4 until it successfully converges. 


MW” B&B W N 


Presented graphically, this process looks as follows: 
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Upon completing training, the self-trained model would be tested and validated. This may be done via 
cross-validation or even using held-out, labeled data, should this exist. 


Self-training provides real power and time saving, but 1s also a risky process. In order to understand 
what to look out for and how to apply self-training to your own classification algorithms, let's look in 


more detail at how the algorithm works. 
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To support this discussion, we're going to work with code from the semisup-learn GitHub repository. 
In order to use this code, we'll need to clone the relevant GitHub repository. Instructions for this are 
located in Appendix A. 


Implementing self-traming 


The first step in each iteration of self-training 1s one in which class labels are generated for unlabeled 
cases. This is achieved by first creating a Sel fLearningModel class, which takes a base supervised 
model (basemodel) and an iteration limit as arguments. As we'll see later 1n this chapter, an iteration 
limit can be explicitly specified or provided as a function of classification accuracy (that is, 
convergence). The prob threshold parameter provides a minimum quality bar for label acceptance; 
any projected label that scores at less than this level will be rejected. Again, we'll see in later 
examples that there are alternatives to providing a hardcoded threshold value. 


class SelfLearningModel (BaseEStimator) : 


cer .intt.. (sell, Dasemodel, Max 1¢er = 200; prob tnreshold. = 0.0); 


SeLE.model = basemnoge | 
Seliam@as L6Ce = ek. Leer 
Sscli«prod thresnola — prob thresniola 


Having defined the shell of the sel fLearningModel class, the next step 1s to define functions for the 
process of semi-supervised model fitting: 


def fit(self, X, y): 
unlabeledX = X[y==-l, :] 
labeledxX = X[y!=-l, :] 
labeledy = yly!=-l1] 


self.model.fit(labeledxX, labeledy) 

unlabeledy = self.predict (unlabeledx) 
UnLabeledadprolb: = Selt.predice proba (unlabeledcx) 
Unlaveloay ola => i] 


1 = 0 


The x parameter is a matrix of input data, whose shape is equivalent to [n samples, n features]. 
x 1s used to create a matrix of [n samples, n samples] size. The y parameter, meanwhile, is an 
array of labels. Unlabeled points are marked as -1 in y. From x, the unlabeledx and labeledx 
parameters are created quite simply by operations over x that select elements in x whose position 
corresponds to a -1 label in y. The 1abeledy parameter performs a similar selection over y. 
(Naturally, we're not that interested in the unlabeled samples of y as a variable, but we need the 
labels that do exist for classification attempts!) 


The actual process of label prediction is achieved, first, using sklearn's predict operation. The 
unlabeledy parameter is generated using sklearn's predict method, while the predict proba 
method 1s used to calculate probabilities for each projec ected label. These probabilities are stored in 
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Note 


Scikit-learn's predict and predict proba methods work to predict class labels and the 
probability of class labeling being correct, respectively. As we'll be applying both of these methods 
within several of our semi-supervised algorithms, it's informative to understand how they actually 
work. 


The predict method produces class predictions for input data. It does so via a set of binary 
classifiers (that 1s, classifiers that attempt to differentiate only two classes). A full model with n-many 
classes contains a set of binary classifiers as follows: 

n*(n—-1) 


. 


— 


In order to make a prediction for a given case, all classifiers whose scores exceed zero, vote for a 
class label to apply to that case. The class with the most votes (and not, say, the highest sum classifier 
score) 1s identified. This is referred to as a one-versus-one prediction method and is a fairly common 
approach. 


Meanwhile, predict proba works by invoking Platt calibration, a technique that allows the outputs 
of a classification model to be transformed into a probability distribution over the classes. This 
involves first training the base model in question, fitting a regression model to the classifier's scores: 


ee  __ 
ee (1+exp(4* f(X)+B)) 


This model can then be optimized (through scalar parameters A and 8) using a maximum likelihood 
method. In the case of our self-training model, predict proba allows us to fit a regression model to 
the classifier's scores and thus calculate probabilities for each class label. This is extremely helpful! 


Next, we need a loop for iteration. The following code describes a while loop that executes until 
there are no cases left in unlabeledy old (acopy of unlabeledy) or until the max iteration count is 
reached. On each iteration, a labeling attempt is made for each case that does not have a label whose 
probability exceeds the probability threshold (prob threshold): 


while (len(unlabeledy old) == 0 or 
NnUMpYywany(unlebeledy!=uUntabeledy o1¢d)) end 2 — selt.max iter: 
Untabeleay old = NuMpy Copy (UnLabeledy) 
Uldx. = NuUmpy.where((unlabeleaprools, 0] > Sselt.proD threshold) 
| (unlabeledprob[:, 1] > sell. prop pogeshotd)) [0] 
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The self.model.f£it method then attempts to fit a model to the unlabeled data. This unlabeled data 
is presented in a matrix of size [n samples, n samples] (as referred to earlier in this chapter). 
This matrix is created by appending (with vstack and hstack) the unlabeled cases: 


self.model.fit(numpy.vstack((labeledxX, unlabeledX[uidx, :])), 
numpy.hstack((labeledy, unlabeledy old[uidx]))) 


Finally, the iteration performs label predictions, followed by probability predictions for those labels. 


unlabeledy = self.predict (unlabeledx) 
Unlabeleaprob = seltu.predict proba (uUntabelecx) 
i += 1 


On the next iteration, the model will perform the same process, this time taking the newly labeled data 
whose probability predictions exceeded the threshold as part of the dataset used in the model. fit 
Step. 


If one's model does not already include a classification method that can generate label predictions 
(like the predict proba method available in sklearn's SVM implementation), it is possible to 
introduce one. The following code checks for the predict proba method and introduces Platt 
scaling of generated labels if this method 1s not found: 


ie NOL Geleatlt re (sett .mocel, "predict. proba’, None): 
self.plattlr = LR() 
preds = self.model.predict (labeledx) 
self.plattlr.fit( preds.reshape( -1l, 1 ), labeledy ) 


return self 


Cer predice proba(seli, A): 
Lt Geltattr (selt.mocdel, “predice proba”, None) % 
beturn selr.model.preaqiere proba (x) 
else: 
preds = self.model.predict (X) 
Peruri. Sli spilarl Lr. precieu proba (precs.tesnape, =—l, 1 i) 


Once we have this much in place, we can begin applying our self-training architecture. To do so, let's 
crab a dataset and start working! 


For this example, we'll use a simple linear regression classifier, with Stochastic Gradient Descent 
(SGD) as our learning component as our base model (basemodel). The input dataset will be the 
statlog heart dataset, obtained from www.mldata.org. This dataset is provided in the GitHub 
repository accompanying this chapter. 


The heart dataset 1s a two-class dataset, where the classes are the absence or presence of a heart 
disease. There are no missing values across the 270 cases for any of its 13 features. This data 1s 


unlabeled and many of the variables needed aygousualdy.captured via expensive and sometimes 
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inconvenient tests. The variables are as follows: 


age 

sex 

chest pain type (4 values) 

resting blood pressure 

serum cholestoral in mg/dl 

festang blood Sugar > 120 mg/dl 

resting electrocardiographic results (values 0,1,2) 
maximum heart rate achieved 

exercise induced angina 

10. oldpeak = ST depression induced by exercise relative to rest 
the slope of the peak exercise ST segment 


number of major vessels (0-3) colored by flourosopy 


thal: 3 = normal; 6 = fixed defect; 7 = reversable defect 


Lets get started with the Heart dataset by loading in the data, then fitting a model to it: 


heart = Tetchn Miloata("heart”™) 
X = heart.data 

ytrue = np.copy(heart.target) 
ytrue[ytrue==-1]=0 


labeled N = 2Z 

ys = np.array([-1]*len(ytrue)) # -1 denotes unlabeled point 
random labeled points = random.sample(np.where(ytrue == 0) [0], 
labeled N/2)+\random.sample(np.where(ytrue == 1)[0], labeled N/2) 
Velrendom Jabeleo points] = yerue| random labeled points! 


basemodel = SGDClassifier(loss='log', penalty='11') 


Dbasemodel.F1le(x (random labeled points; «i, vs ilrancom Jabeled points) 
print "Supervised log.reg. score", basemodel.score(X, ytrue) 


ssmodel = SelfLearningModel (basemodel) 


ssmodel.fit(X, ys) 
print “self-learning log.reg. score", ssmodel.score(X, ytrue) 


Attempting this yields moderate, but not excellent, results: 


self-learning log.reg. score 0.470347 


However, over 1,000 trials, we find that the quality of our outputs is quite variant: 
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Boxplot of Self-Training Model Score Across 30 Trials 





Given that we're looking at classification accuracy scores for sets of real-world and unlabeled data, 
this isn't a terrible result, but I don't think we should be satisfied with it. We're still labeling more than 
half of our cases incorrectly! 


We need to understand the problem a little better; right now, it isn't clear what's going wrong or how 
we can improve on our results. Let's figure this out by returning to the theory around self-training to 
understand how we can diagnose and improve our implementation. 


Finessing your Self-training implementation 


In the previous section, we discussed the creation of self-training algorithms and tried out an 
implementation. However, what we saw during our first trial was that our results, while 
demonstrating the potential of self-training, left room for growth. Both the accuracy and variance of 
our results were questionable. 


Self-training can be a fragile process. If an element of the algorithm 1s 11l-configured or the input data 
contains peculiarities, it is very likely that the iterative process will fail once and continue to 
compound that error by reintroducing incorrectly labeled data to future labeling steps. As the self- 
training algorithm iteratively feeds itself, garbage in, garbage out 1s a very real concern. 


There are several quite common flavors of risk that should be called out. In some cases, labeled data 
may not add more useful information. This is particularly common in the first few iterations, and 
understandably so! In general, unlabeled cases that are most easily labeled are the ones that are most 
similar to existing labeled cases. However, while it's easy to generate high-probability labels for 
these cases, there's no guarantee that their addition to the labeled set will make it easier to label 
during subsequent iterations. 


Unfortunately, this can sometimes lead to a situation in which cases are being added that have no real 

effect on classification while classification accuracy 1n general deteriorates. Even worse, adding 

cases that are similar to pre-existing cases 1n ehotigt? PESpects to make them easy to label, but that 
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actually misguide the classifier's decision boundary, can introduce misclassification increases. 


Diagnosing what went wrong with a self-training model can sometimes be difficult, but as always, a 
few well-chosen plots add a lot of clarity to the situation. As this type of error occurs particularly 
often within the first few iterations, simply adding an element to the label prediction loop that writes 
the current classification accuracy allows us to understand how accuracy trended during early 
iterations. 


Once the issue has been identified, there are a few possible solutions. If enough labeled data exists, a 
simple solution is to attempt to use a more diverse set of labeled data to kick-start the process. 


While the impulse might be to use all of the labeled data, we'll see later 1n this chapter that self- 
training models are vulnerable to overfitting—a risk that forces us to hold on to some data for 
validation purposes. A promising option is to use multiple subsets of our dataset to train multiple 
self-training model instances. Doing so, particularly over several trials, can help us understand the 
impact of our input data on our self-training models performance. 


In Chapter 8, Ensemble Methods, we'll explore some options around ensembles that will enable us to 
use multiple self-training models together to yield predictions. When ensembling 1s accessible to us, 
we can even consider applying multiple sampling techniques in parallel. 


If we don't want to solve this problem with quantity, though, perhaps we can solve it by improving 
quality. One solution 1s to create an appropriately diverse subset of the labeled data through selection. 
There isn't a hard limit on the number of labeled cases that works well as a mimimum amount to start 
up a Self-training implementation. While you could hypothetically start working with even one 
labeled case per class (as we did in our preceding training example), it'll quickly become obvious 
that training against a more diverse and overlapping set of classes benefits from more labeled data. 


Another class of error that a self-training model 1s particularly vulnerable to is biased selection. Our 
naive assumption 1s that the selection of data during each iteration is, at worst, only slightly biased 
(favoring one class only slightly more than others). The reality is that this is not a safe assumption. 
There are several factors that can influence the likelihood of biased selection, with the most likely 
culprit being disproportionate sampling from one class. 


If the dataset as a whole, or the labeled subsets used, are biased toward one class, then the risk 
increases that your self-training classifier will overfit. This only compounds the problem as the cases 
provided for the next iteration are liable to be insufficiently diverse to solve the problem; whatever 
incorrect decision boundary was set up by the self-training algorithm will be set where 1t is—overfit 
to a subset of the data. Numerical disparity between each class' count of cases 1s the main symptom 
here, but the more usual methods to spot overfitting can also be helpful in diagnosing problems 
around selection bias. 


Note 


This reference to the usual methods of spotting SVerh tiie is worth expanding on because techniques 
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to identify overfitting are highly valuable! These techniques are typically referred to as validation 
techniques. The fundamental concept underpinning validation techniques is that one has two sets of 
data—one that 1s used to build a model, and the other is used to test it. 


The most effective validation technique is independent validation, the simplest form of which 
involves waiting to determine whether predictions are accurate. This obviously isn't always (or even, 
often) possible! 


Given that it may not be possible to perform independent validation, the best bet is to hold out a 
subset of your sample. This is referred to as sample splitting and 1s the foundation of modern 
validation techniques. Most machine learning implementations refer to training, test, and validation 
datasets; this is a case of multilayered validation in action. 


A third and critical validation tool 1s resampling, where subsets of the data are iteratively used to 
repeatedly validate the dataset. In Chapter 1, Unsupervised Machine Learning, we saw the use of v- 
fold cross-validation; cross-validation techniques are perhaps the best examples of resampling 1n 
action. 


Beyond applicable techniques, it's a good idea to be mindful of the needed sample size required for 
the effective modeling of your data. There are no universal principles here, but I always rather liked 
the following rule of thumb: 


If m points are required to determine a univariate regression line with sufficient precision, then it will 
take at least mn observations and perhaps n/mn observations to appropriately characterize and 
evaluate a regression model withn variables. 


Note that there 1s some tension between the suggested solutions to this problem (resampling, sample 
splitting, and validation techniques including cross-validation) and the preceding one. Namely, 
overfitting requires a more restrained use of subsets of the labeled training data, while bad starts are 
less likely to occur using more training data. For each specific problem, depending on the complexity 
of the data under analysis, there will be an appropriate balance to strike. By monitoring for signs of 
either type of problem, the appropriate action (whether that 1s an increase or decrease 1n the amount 
of labeled data used simultaneously 1n an iteration) can be taken at the right time. 


A further class of risk introduced by self-training 1s that the introduction of unlabeled data almost 
always introduces noise. If dealing with datasets where part or all of the unlabeled cases are highly 
noisy, the amount of noise introduced may be sufficient to degrade classification accuracy. 


Note 


The idea of using data complexity and noise measures to understand the degree of noise in one's 
dataset 1s not new. Fortunately for us, quite a lot of good estimators already exist that we can take 
advantage of. 


There are two main groups of relative complex@wiméastares. Some attempt to measure the overlap of 
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values of different classes, or separability; measures 1n this group attempt to describe the degree of 
ambiguity of each class relative to the other classes. One good measure for such cases 1s the 
maximum Fisher's discriminant ratio, though maximum individual feature efficiency 1s also effective. 


Alternatively (and sometimes more simply), one can use the error function of a linear classifier to 
understand how separable the dataset's classes are from one another. By attempting to train a simple 
linear classifier on your dataset and observing the training error, one can immediately get a good 
understanding as to how linearly separable the classes are. Furthermore, measures related to this 
classifier (such as the fraction of points 1n the class boundary or the ratio of average intra/inter class 
nearest neighbor distance) can also be extremely helpful. 


There are other data complexity measures that specifically measure the density or geometry of the 
dataset. One good example 1s the fraction of maximum covering spheres. Again, helpful measures can 
be accessed by applying a linear classifier and including the nonlinearity of that classifier. 


Improving the selection process 


The key to the self-training algorithm working correctly 1s the accurate calculation of confidence for 
each label projection. Confidence calculation is the key to successful self-training. 


During our first explanation of self-training, we used some simplistic values for certain parameters, 
including a parameter closely tied to confidence calculation. In selecting our labeled cases, we used a 
fixed confidence level for comparison against predicted probabilities, where we could've adopted 
any one of several different strategies: 


e Adding all of the projected labels to the set of labeled data 
e Using a confidence threshold to select only the few most confident labels to the set 
e Adding all the projected labels to the labeled dataset and weighing each label by confidence 


All in all, we've seen that self-training implementations present quite a lot of risk. They're prone to a 
number of training failures and are also subject to overfitting. To make matters worse, as the amount 
of unlabeled data increases, the accuracy of a self-training classifier becomes increasingly at risk. 


Our next step will be to look at a very different self-training implementation. While conceptually 
similar to the algorithm that we worked with earlier in this chapter, the next technique we'll be 
looking at operates under different assumptions to yield very different results. 
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Contrastive Pessimistic Likelihood Estimation 


In our preceding discovery and application of self-training techniques, we found self-training to be a 
powerful technique with significant risks. Particularly, we found a need for multiple diagnostic tools 
and some quite restrictive dataset conditions. While we can work around these problems by 
subsetting, identifying optimal labeled data, and attentively tracking performance for some datasets, 
some of these actions continue to be impossible for the very data that self-training would bring the 
most benefit to—data where labeling requires expensive tests, be those medical or scientific, with 
specialist knowledge and equipment. 


In some cases, we end up with some self-training classifiers that are outperformed by their 
Supervised counterparts, which is a pretty terrible state of affairs. Even worse, while a supervised 
classifier with labeled data will tend to improve 1n accuracy with additional cases, semi-supervised 
classifier performance can degrade as the dataset size increases. What we need, then, is a less naive 
approach to semi-supervised learning. Our goal should be to find an approach that harnesses the 
benefits of semi-supervised learning while maintaining performance at least comparable with that of 
the same classifier under a supervised approach. 


A very recent (May 2015) approach to self-supervised learning, CPLE, provides a more general way 
to perform semi-supervised parameter estimation. CPLE provides a rather remarkable advantage: it 
produces label predictions that have been demonstrated to consistently outperform those created by 
equivalent semi-supervised classifiers or by supervised classifiers working from the labeled data! In 
other words, when performing a linear discriminant analysis, for instance, it is advised that you 
perform a CPLE-based, semi-supervised analysis instead of a supervised one, as you will always 
obtain at least equivalent performance. 


This is a pretty big claim and it needs substantiating. Let's start by building an understanding of how 
CPLE works before moving on to demonstrate its superior performance in real cases. 


CPLE uses the familiar measure of maximized log-likelihood for parameter optimization. This can be 
thought of as the success condition; the model we'll develop 1s intended to optimize the maximized 
log-likelihood of our model's parameters. It is the specific guarantees and assumptions that CPLE 
incorporates that make the technique effective. 


In order to create a better semi-supervised learner—one that improves on it's supervised alternative 
—CPLE takes the supervised estimates into account explicitly, using the loss incurred between the 
semi-supervised and supervised models as a training performance measure: 
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When the semi-supervised model correctly 
classifies cases that the supervised model doesn't: 


Supervised 





Unsupervised 


CPLE uses the semiésupervised results to gain an edge 
over supervised approaches. 


When the supervised model correctly classifies 
cases that the semi-supervised model doesn't: 


CPLE falls back on the more reliable results from the 
supervised model, guaranteeing parity with a 
SUpemvised approach. 


Supervised 





Unsupervised 





CPLE calculates the relative improvement of any semi-supervised estimate over the supervised 
solution. Where the supervised solution outperforms the semi-supervised estimate, the loss function 
shows this and the model can train to adjust the semi-supervised model to reduce this loss. Where the 
semi-supervised solution outperforms the supervised solution, the model can learn from the sem1- 
Supervised model by adjusting model parameters. 


However, while this sounds excellent so far, there 1s a flaw in the theory that has to be addressed. The 
fact that data labels don't exist for a semi-supervised solution means that the posterior distribution 
(that CPLE would use to calculate loss) 1s inaccessible. CPLE's solution to this is to be pessimistic. 
The CPLE algorithm takes the Cartesian product of all label/prediction combinations and then 
selects the posterior distribution that minimizes the gain 1n likelihood. 


In real-world machine learning contexts, this is a very safe approach. It delivers the classification 
accuracy of a supervised approach with semi-supervised performance improvement derived via 
conservative assumptions. In real applications, these conservative assumptions enable high 
performance under testing. Even better, CPLE can deliver particular performance improvements on 
some of the most challenging unsupervised learning cases, where the labeled data 1s a poor 
representation of the unlabeled data (by virtue, Qh RODE aeunp ling from one or more classes or just 


because of a shortage of unlabeled cases). 


In order to understand how much more effective CPLE can be than semi-supervised or supervised 
approaches, let's apply the technique to a practical problem. We'll once again work with the semisup- 
learn library, a specialist Python library, focused on semi-supervised learning, which extends scikit- 
learn to provide CPLE across any scikit-learn-provided classifier. We begin with a CPLE class: 


class CPLELearningModel (BaseEsStimator): 


def s20ot (self, Dasemocel, Dessimistie-lite, Preaice From procabitivics = 
False, Use Semple weignting = True, Max. 176er—3000, verbose = 1); 
selLt.mocel = Dasemode | 
self.pessimistic = pessimistic 
SeLisPEeSOLce Trom probabil tities = precact from probabil cies 
self.use sample weighting = use sample weighting 
ScLiamM@ax 126Cl = Max 1cer 
SeliaeVerbOse = Verbose 


We're already familiar with the concept of basemodel. Earlier in this chapter, we employed S3 VMs 
and semi-supervised LDE's. In this situation, we'll again use an LDE; the goal of this first assay will 
be to try and exceed the results obtained by the semi-supervised LDE from earlier in this chapter. In 
fact, we're going to blow those results out of the water! 


Before we do so, however, let's review the other parameter options. The pessimistic argument 
gives us an opportunity to use a non-pessimistic (optimistic) model. Instead of following the 
pessimistic method of minimizing the loss between unlabeled and labeled discriminative 
likelihood, an optimistic model aims to maximize likelihood. This can yield better results (mostly 
during training), but is significantly more risky. Here, we'll be working with pessimistic models. 


The predict from probabilities parameter enables optimization by allowing a prediction to be 
generated from the probabilities of multiple data points at once. If we set this as true, our CPLE will 
set the prediction as 1 if the probability we're using for prediction 1s greater than the mean, or 0 
otherwise. The alternative is to use the base model probabilities, which 1s generally preferable for 
performance reasons, unless we'll be calling predict across a number of cases. 


We also have the option to use sample weighting, otherwise known as soft labels (but most 
familiar to us as posterior probabilities). We would normally take this opportunity, as soft labels 
enable greater flexibility than hard labels and are generally preferred (unless the model only supports 
hard class labels). 


The first few parameters provide a means of stopping CPLE training, either at maximum iterations or 
after log-likelihood stops improving (typically because of convergence). The bestd1 provides the 
best discriminative likelihood value and corresponding soft labels; these values are updated on each 
training iteration: 


self.it = 0 
self.noimprovementsince = 0 WOW! eBook 
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self.maxnoimprovementsince = 3 


self.buffersize = 200 
self.lastdls = [0O]*self.buffersize 


self.bestdl = numpy.infty 
self.bestlbls = [] 


self.id = 
str (uni chr (numpy.random. randint (26) +97) )+str(unichr(numpy.random.randint (26) +97) 


) 


The discriminative likelihood function calculates the likelihood (for discriminative models— 
that 1s, models that aim to maximize the probability of a target—y = /, conditional on the input, _X) of 
an input. 


Note 


In this case, it's worth drawing your attention to the distinction between generative and discriminative 
models. While this isn't a basic concept, 1t can be fundamental 1n understanding why many classifiers 
have the goals that they do. 


A classification model takes input data and attempts to classify cases, assigning each case a label. 
There is more than one way to do this. 


One approach 1s to take the cases and attempt to draw a decision boundary between them. Then we 
can take each new case as it appears and identify which side of the boundary it falls on. This is a 
discriminative learning approach. 


Another approach 1s to attempt to model the distribution of each class individually. Once a model has 
been generated, the algorithm can use Bayes' rule to calculate the posterior distribution on the labels 
given input data. This approach 1s generative and is a very powerful approach with significant 
weaknesses (most of which tie into the question of how well we can model our classes). Generative 
approaches include Gaussian discriminant models (yes, that is a slightly confusing name) and a broad 
range of Bayesian models. More information, including some excellent recommended reading, is 
provided in the Further reading section of this chapter. 


In this case, the function will be used on each iteration to calculate the likelihood of the predicted 
labels: 


Cer O1sSCriminative takelihood(selt, mocel, tbabeledDara, labeleay = None, 
unlabeledData = None, unlabeledWeights = None, unlabeledlambda = 1, gradient=[], 
alpha: = UO. 0d.) 3 

unlabeledy = (unlabeledWeights[:, 0]<0.5)*1 
uweights = numpy.copy(unlabeledWeights[:, 0O]) 


uwelights [unlabeledy==1] = l-uweights[unlabeledy==1] 
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weights = numpy.hstack((numpy.ones(len(labeledy)), uweights) ) 
labels = numpy.hstack((labeledy, unlabeledy) ) 


Having defined this much of our CPLE, we also need to define the fitting process for our supervised 
model. This uses familiar components, namely, model. fit and model.predict proba, for 
probability prediction: 


Lt seLi.Use Sample welgneing: 
model.fit(numpy.vstack((labeledData, unlabeledData)), labels, 
sample weight=weights) 
else: 
model.fit(numpy.vstack((labeledData, unlabeledData)), labels) 


P= model.precicrt proba (labelLecData) 


In order to perform pessimistic CPLE, we need to derive both the labeled and unlabeled 
discriminative log likelihood. In order, we then perform predict proba on both the labeled and 
unlabeled data: 


Ley. 

FabeLedDi. = =skiéarm.metrics.100 loss (labeledy, P) 
except Exception, e: 

print e 


P = model ..preaict proba (labeledData) 


unlabeledP = model.predict proba (unlabeledData) 


Ly: 
eps = le-15 
unlabeledP = numpy.clip(unlabeledP, eps, 1 - eps) 
unlabeledDL = numpy.average((unlabeledWeights*numpy.vstack ((1- 
unlabeledy, unlabeledy)) .T*numpy.log(unlabeledP) ).sum(axis=1) ) 
except Exception, e: 
print e 
UnlabeleaP = modsl.predact proba (unlaveleaData) 


Once we're able to calculate the discriminative log likelihood for both the labeled and unlabeled 
classification attempts, we can set an objective via the discriminative likelihood objective 
function. The goal here is to use the pessimistic (or optimistic, by choice) methodology to calculate 
dl on each iteration until the model converges or the maximum iteration count 1s hit. 


On each iteration, a t-test is performed to determine whether the likelihoods have changed. 
Likelihoods should continue to change on each iteration preconvergence. Sharp-eyed readers may 
have noticed earlier in the chapter that three consecutive t-tests showing no change will cause the 


iteration to stop (this 1s configurable via the maxnoimprovementsince parameter): 
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if self.pessimistic: 

dl = unlabeledlambda * unlabeledDL - lLabeledDL 
else: 

dl = - unlabeledlambda * unlabeledDL - labeledDL 


return dl 


CCL C1LSeCriminative likelinood objective(selit, model, Jlavebeapatla, labveledy = 
None, unlabeledData = None, unlabeledWeights = None, unlabeledlambda = 1, 
gradient=[], alpha = 0.01): 

it S6l7,10C == Os 
self.lastdls = [0]*self.buffersize 


Gl = Selt,diecriminarive Jikelinood (model, tabelecData, abe Leddy, 
unlabeledData, unlabeledWeights, unlabeledlambda, gradient, alpha) 


self.it += 1 
self.lastdls[numpy.mod(self.it, len(self.lastdls))] = dl 


if numpy.mod(self.it, self.buffersize) == 0: # or True: 
improvement = numpy.mean((self.lastdls[(len(self.lastdls)/2):])) - 
numpy.mean((self.lastdls[: (len(self.lastdls)/2)])) 


_y prob = 
scipy.stats.ttest ind(self.lastdls[(len(self.lastdls)/2):], self.lastdls[: 


(len(self.lastdls)/2)]) 


noimprovement = prob > Q.1 and 
numpy.mean(self.lastdls[(len(self.lastdls)/2):]) < numpy.mean(self.lastdls[: 
(len(self.lastdls) /2)]) 
1f noimprovement: 
self.noimprovementsince += 1 
if self.noimprovementsince >= self.maxnoimprovementsince: 


self.noimprovementsince = 0 
raise Exception(" converged.") 
else: 
self.noimprovementsince = 0 


On each iteration, the algorithm saves the best discriminative likelihood and the best weight set for 
use 1n the next iteration: 


if dl < self.bestdl: 
self.bestdl = dl 
self.bestlbls = numpy.copy(unlabeledWeights[:, 0]) 


return dl 


One more element worth discussing 1s how the soft labels are created. We've discussed these earlier 
in the chapter. This 1s how they look in code: 


[= lambda. SoOrtlabels, Grag=| |: WOW! eBook 
self.discriminative likelihood_obyapbhwowéesGoldorgodel, labeledx, 


labeledy=labeledy, unlabeledData=unlabeledx, 
unlabeledWeights=numpy.vstack((softlabels, l-softlabels)).T, gradient=grad) 


Iblinit = numpy.random.random(len(unlabeledy) ) 


In a nutshell, soft labels provide a probabilistic version of the discriminative likelihood 
calculation. In other words, they act as weights rather than hard, binary class labels. Soft labels are 
calculable using the optimize method: 


Ley. 
self.it = 0 
Opt = DLOpt.c0pu(mtOpi.GN. DIRECT 1; RAND, M) 
opt.set lower bounds (numpy.zeros (M) ) 
Optl.Set. Upper- bounds (numpy.ones (M).) 
Opi~sel Min Ob ecrive (ss) 
OPt.Set Maxeval (selr.max 1067) 
self.bestsoftlbl = opt.optimize(lblinit) 
Dring “ Max t2er exceeded.” 

except Exception, e: 
print e 
self.bestsoftlbl = self.bestlbls 


if numpy.any(self.bestsoftlbl != self.bestlbls): 
self.bestsoftlbl = self.bestlbls 
ll = £(self.bestsoftlbl) 


unlabeledy = (self.bestsoftlbl<0.5)*1 
uweights = numpy.copy(self.bestsoftlbl) 


uwelights [unlabeledy==1] = l-uweights[unlabeledy==1] 


weights = numpy.hstack((numpy.ones(len(labeledy)), uweights) ) 
labels = numpy.hstack((labeledy, unlabeledy) ) 


Note 


For interested readers, optimize uses the Newton conjugate gradient method of calculating gradient 
descent to find optimal weight values. A reference to Newton conjugate gradient is provided in the 
Further reading section at the end of this chapter. 


Once we understand how this works, the rest of the calculation is a straightforward comparison of the 
best supervised labels and soft labels, setting the best softlabel parameter as the best label set. 
Following this, the discriminative likelihood 1s computed against the best label set and a fit function 
is calculated: 


it Selisuse Sample welgnling: 
self.model.fit(numpy.vstack((labeledX, unlabeledX)), labels, 
sample weight=weights) 
else: 
self.model.fit(numpy.vstack((labeledX, unlabeledxX)), labels) 


if self.verbose > 1: WOW! eBook 
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print "number of non-one soft labels: ", numpy.sum(self.bestsoftlbl 
'= 1), ", balance:", numpy.sum(self.bestsoftlb1l<0.5), "/ ", 
len(self.bestsoftlbl) 

print "current likelihood: ", ll 


Now that we've had a chance to understand the implementation of CPLE, let's get hands-on with an 
interesting dataset of our own! This time, we'll change things up by working with the University of 
Columbia's Million Song Dataset. 


The central feature of this algorithm 1s feature analysis and metadata for one million songs. The data 
iS preprepared and made up of natural and derived features. Available features include things such as 
the artist's name and ID, duration, loudness, time signature, and tempo of each song, as well as other 
measures including a crowd-rated danceability score and tags associated with the audio. 


This dataset is generally labeled (via tags), but our objective 1n this case will be to generate genre 
labels for different songs based on the data provided. As the full million song dataset is a rather 
forbidding 300 GB, let's work with a 1% (1.8 GB) subset of 10,000 records. Furthermore, we don't 
particularly need this data as it currently exists; it's in an unhelpful format and a lot of the fields are 
going to be of little use to us. 


The 10000 songs dataset residing in the Chapter 6, 7ext Feature Engineering folder of our 
Mastering Python Machine Learning repository 1s a cleaned, prepared (and also rather large) subset 
of music data from multiple genres. In this analysis, we'll be attempting to predict genre from the 
genre tags provided as targets. We'll take a subset of tags as the labeled data used to kick-start our 
learning and will attempt to generate tags for unlabelled data. 


In this iteration, we're going to raise our game as follows: 


e Using more labeled data. This time, we'll use 1% of the total dataset size (100 songs), taken at 
random, as labeled data. 

e Using an SVM with a linear kernel as our classifier, rather than the simple linear discriminant 
analysis we used with our naive self-training implementation earlier in this chapter. 


So, let's get started: 


import sklearn.svm 
import numpy as np 
import random 


from frameworks.CPLELearning import CPLELearningModel 
from methods import scikitTSVM 
ErOm. examples. pLocurcils ampore evaluate. and pPLor 


kerne!) = “linear” 


songs = tetrch micata ("10000 songs”) 
X = songs.data 
ytrue = np.copy(songs.target) 
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labeled. N = 20 


Vo = Dp.array(..-1)]* len ycrue)) 

random labeled points = random.sample(np.where(ytrue == 0)[0], labeled N/2)+\ 
random.sample(np.where(ytrue == 1)[0], labeled N/2) 

Vso|ranodom. babe leo points] = yrruel|random Jabeleo points | 


For comparison, we'll run a supervised SVM alongside our CPLE implementation. We'll also run the 
naive self-supervised implementation, which we saw earlier in this chapter, for comparison: 


basemodel = SGDClassifier(loss='log', penalty='11') # scikit logistic regression 
basemocel <f1C(xX\rengom labeled points, <j), vyelvancom tabeled porns] ) 
print "Supervised log.reg. score", basemodel.score(X, ytrue) 


ssmodel = SelfLearningModel (basemodel) 
ssmodel.fit(X, ys) 
print "self-learning log.reg. score", ssmodel.score(X, ytrue) 


ssmodel = CPLELearningModel (basemodel) 
ssmodel.fit(X, ys) 
print "CPLE semi-supervised log.reg. score", ssmodel.score(X, ytrue) 


The results that we obtain on this iteration are very strong: 


# supervised log.reg. score 0.698 
# self-learning log.reg. score 0.825 
# CPLE semi-supervised log.reg. score 0.833 


ROC curve for CPLE semi-supervised log.reg. classification of the Heart dataset 
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ROC curve of class 0 (area = 0.84) 
ROC curve of class 1 (area = 0.84) 


0.4 0.6 
False Positive Rate 





The CPLE semi-supervised model succeeds 1n classifying with 84% accuracy, a score comparable to 
human estimation and over 10% higher than the naive semi-supervised implementation. Notably, it 


also outperforms the supervised SVM. 
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Further reading 


A solid place to start understanding Semi-supervised learning methods is Xiaojin Zhu's very thorough 
literature survey, available at http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf. 


I also recommend a tutorial by the same author, available 1n the slide format at 


http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf. 


The key paper on Contastive Pessimistic Likelihood Estimation 1s Loog's 2015 paper 


http://arxiv.org/abs/1503.00269. 


This chapter made a reference to the distinction between generative and discriminative models. A 
couple of relatively clear explanations of the distinction between generative and discriminative 


algorithms are provided by Andrew Ng (http://cs229.stanford.edu/notes/cs229-notes2.pdf) and 
Michael Jordan (http://www.ics.uci.edu/~smyth/courses/cs274/readings/jordan_logistic.pdf). 





For readers interested in Bayesian statistics, Allen Downey's book, Think Bayes, 1s a marvelous 
introduction (and one of my all-time favorite statistics books): 


https://www.google.co.uk/#q=think+bayes. 


For readers interested in learning more about gradient descent, I recommend Sebastian Ruder's blog 
at http://sebastianruder.com/optimizing-gradient-descent/. 


For readers interested in going a little deeper into the internals of conjugate descent, Jonathan 
Shewchuk's introduction provides clear and enjoyable definitions for a number of key concepts at 


https://www.cs.cmu.edu/~quake-papers/painless-conjugate- gradient. pdf. 
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Summary 


In this chapter, we tapped into a very powerful but lesser known paradigm in machine learning— 
semi-supervised learning. We began by exploring the underlying concepts of transductive learning and 
self-training, and improved our understanding of the latter class of techniques by working witha 
naive self-training implementation. 


We quickly began to see weaknesses in self-training and looked for an effective solution, which we 
found in the form of CPLE. CPLE 1s a very elegant and highly applicable framework for semi- 
supervised learning that makes no assumptions beyond those of the classifier that it uses as a base 
model. In return, we found CPLE to consistently offer performance in excess of naive sem1- 
Supervised and supervised implementations, at minimal risk. We've gained a significant amount of 
understanding regarding one of the most useful recent developments 1n machine learning. 


In the next chapter, we'll begin discussing data preparation skills that significantly increase the 
effectiveness of all of the models that we've previously discussed. 


WOW! eBook 
www.wowebook.org 


Chapter 6. Text Feature Engineering 
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Introduction 


In preceding chapters, we've spent time assessing powerful techniques that enable the analysis of 
complex or challenging data. However, for the most difficult problems, the right technique will only 
get you So far. 


The persistent challenge that deep learning and supervised learning try to solve for is that finding 
solutions often requires multiple big investments from the team in question. Under the old paradigm, 
one often has to perform specific preparation tasks, requiring time, specialist skills, and knowledge. 
Often, even the techniques used were domain-specific and/or data type-specific. This process, via 
which features are derived, is referred to as feature engineering. 


Most of the deep learning algorithms which we've studied so far are intended to help find ways 
around needing to perform extensive feature engineering. However, at the same time, feature 
engineering continues to be seen as a hugely important skill for top-level ML practitioners. The 
following quotes come from leading Kaggle competitors, via David Kofoed Wind's contribution to 
the Kagele blog: 


"The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement 
the information gain given by correct feature engineering." 


--(Luca Massaron) 


"Feature engineering is certainly one of the most important aspects in Kaggle competitions and it is the part where one 
should spend the most time on. There are often some hidden features in the data which can improve your performance by a 
lot and if you want to get a good place on the leaderboard you have to find them. If you screw up here you mostly can't win 
anymore, there is always one guy who finds all the secrets. However, there are also other important parts, like how you 
formulate the problem. Will you use a regression model or classification model or even combine both or is some kind of 
ranking needed. This, and feature engineering, are crucial to achieve a good result in those competitions. There are also 
some competitions where (manual) feature engineering is not needed anymore; like in image processing competitions. 
Current state of the art deep learning algorithms can do that for you." 


--(Josef Feigl) 


There are a few key themes here; feature engineering is powerful and even a very small amount of 
feature engineering can have a big impact on one's classifiers. It is also frequently necessary to 
employ feature engineering techniques if one wishes to deliver the best possible results. Maximising 
the effectiveness of your machine learning algorithms requires a certain amount of both domain- 
specific and data type-specific knowledge (secrets). 


One more quote: 


"For most Kaggle competitions the most important part is feature engineering, which is pretty easy to learn how to do." 


--(Tim Salimans) 


Tim's not wrong; most of what you'll learn in tiBkapettr is intuitive, effective tricks, and 
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transformations. This chapter will introduce you to a few of the most effective and commonly-used 
preparation techniques applied to text and time series data, drawing from NLP and financial time 
series applications. We'll walk through how the techniques work, what one should expect to see, and 
how one can diagnose whether they're working as desired. 
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Text feature engineering 


In preceding sections, we've discussed some of the methods by which we might take a dataset and 
extract a subset of valuable features. These methods have broad applicability but are less helpful 
when dealing with non-numerical/non-categorical data, or data that cannot be easily translated into 
numerical or categorical data. In particular, we need to apply different techniques when working with 
text data. 


The techniques that we'll study in this section fall into two main categories—cleaning techniques and 
feature preparation techniques. These are typically implemented in roughly that order and we'll study 
them accordingly. 
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Cleaning text data 


When we work with natural text data, a different set of approaches apply. This 1s because in real- 
world contexts, the idea of a naturally clean text dataset 1s pretty unsafe; text data 1s rife with 
misspellings, non-dictionary constructs like emoticons, and in some cases, HTML tagging. As such, 
we need to be very thorough with our cleaning. 


In this section, we'll work with a number of effective text-cleaning techniques, using a pretty gnarly 
real-world dataset. Specifically, we'll be using the Impermium dataset from a 2012 Kaggle 
competition, where the competition's goal was to create a model which accurately detects insults in 
social commentary. 


Yes, I do mean Internet troll detection. 


Let's get started! 
Text cleaning with BeautifulSoup 


Our first step should be manually checking the input data. This 1s pretty critical; with text data, one 
needs to try and understand what issues exist in the data initially so as to identify the cleaning needed. 


It's kind of painful to read through a dataset full of hateful Internet commentary, so here's an example 
entry: 





2012053103191 724"""\xa0@Flip\xaOhow are you not ded™" 


We have an ID field and date field which don't seem to need much work. The text fields, however, are 
quite challenging. From this one case, we can already see misspelling and HTML inclusion. 
Furthermore, many entries in the dataset contain attempts to bypass swear filtering, usually by 
including a space or punctuation element mid-word. Other data quality issues include multiple 
vowels (to extend a word), non-asci1 characters, hyperlinks... the list goes on. 


One option for cleaning this dataset is to use regular expressions, which run over the input data to 
scrub out data quality issues. However, the quantity and variety of problem formats make it 
impractical to use a regex-based approach, at least to begin with. We're likely both to miss a lot of 
cases and also to misjudge the amount of preparation needed, leading us to clean too aggressively, or 
not ageressively enough; in specific terms we risk cutting into real text content or leaving parts of tags 
in place. What we need is a solution that will wash out the majority of common data quality problems 
to begin with so that we can focus on the remaining issues with a script-based approach. 


Enter BeautifulSoup. BeautifulSoup 1s a very powerful text cleaning library which can, among 
other things, remove HTML markup. Let's takela TOSRM this library in action on our troll data: 


w.wowe 


from bs4 import BeautifulSoup 
DMDOLL. CSv 


trolls. = | 

with open('trolls.csv', vit.) gi. rs 
reader = csv.DictReader (f) 
for line in reader: 


trolls.append (BeautifulSoup(str(line["Comment"]), "html.parser") ) 
DrEine (ero.LLe [0] 


eg = BeautifulSoup(str(trolls), "html.parser") 


PEIOt (eC .geu Texc( 


[a 





peeferzosnonsridenie nov are roe oot @Flip how are you not ded 


As we can See, we've already made headway on improving the quality of our text data. Yet, it's also 
clear from these examples that there's a lot of work left to do! As discussed, let's move on to using 
regular expressions to help further clean and tokenize our data. 


Managing punctuation and tokenizing 


Tokenisation is the process of creating a set of tokens from a stream of text. Many tokens are words, 
while others —_ be character sets (such as smilies or other punctuation strings, for example, 


PPPPDDDD 


Now that we've removed a lot of the HTML ugliness from our initial dataset, we can take steps to 
further improve the cleanliness of our text data. To do this, we'll leverage the re module, which 
allows us to use operations over regular expressions, such as substring replacement. We'll perform a 
series of operations over our input text on this pass, which mostly focus on replacing variable or 
problematic text elements with tokens. Let's begin with a simple example, replacing e-mail addresses 
withan «&™M token: 


btext. = 2e.sud(e" [weed te so ee wad) Pw a eZ Aa Ze y 6 EM, exe) 


Similarly, we can remove URLs, replacing them with the vu token: 


text = 2exucubte" wrist’, ©" U", Gexe) 


We can automatically remove extra or problematic whitespace and newline characters, hyphens, and 
underscores. In addition, we'll begin managing the problem of multiple characters, often used for 
emphasis in informal conversation. Extended series of punctuation characters are encoded here using 
codes suchas _8Q and Bx; these longer tags are used as a means of differentiating from the more 
straightforward and x tags (which refer to the use of a question mark and exclamation mark, 
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We can also use regular expressions to manage extra letters; by cutting down such strings to two 
characters at most, we're able to reduce the number of combinations to a manageable amount and 
tokenize that reduced group using the _E1 token: 


# Format whitespaces 


text = 
text = 


Le xc 
cexE 


text = 
text = 
text = 


LexXeE 
Cext 


text = 


Lex 


text = 
text = 
text = 


text 
text 


text = 


re.sub 


we WE), pf eee, Tere 


text.replace('™', ' ') 
tex, replace (’\**, * F) 
See wentaen en 7: "3 
text.replace('-' rT) 
text.replace('\n E *) 
vent replace (‘\\n" - * *) 
text.replace('\' e *) 
be,.euOt” aw.’ 4, pene) 
pee placer a: *) 
#manage pune tats) 
pees Eee EO Ee, ed BOM SY, esc) 
re.sub(r'([*\.]) (\.{2,})', r'\1l  SS\n', text) 
Le.Sub ECLAI\2T) (W211) 12, } AZT EAEAG) cy \1 BAVA eo", Gexe) 
ressub(e’ (11a) Ne AZ I Nel) ty eA -O\n Zt, Cexe) 
Pe.sup te Cl sey ea ee et eg ee 
re.sub(r' ([a-zA-Z]) \1\1+(\w*)', rv'\1\1\2 EL', text) 
re.sub(r' ([a-zA-Z]) \1\1+(\w*)', rv'\1\1\2 EL', text) 
r' (\w 
i 


text = 


re.sub 


a=Zh=7|1,%, Pext) 


Next, we want to begin creating other tokens of interest. One of the more helpful indicators available 
is the sw token for swearing. We'll also use regular expressions to help identify and tokenize smileys 
into one of four buckets; big and happy smileys ( Bs), small and happy ones (_s), big and sad ones 
(_ BF), and small and sad ones (_ F): 


cext 


CeXt 
imep ae 
cext 


text = 


Note 


re .sub (rr 


re.sub 
re.sub 
= re.sub 
re.sub 


'(L#SEN*\ST (2, }) (Awe) ', 


CVV We LeXE) 


Poxe ele Cee NNN bevaZe tr , 2" Bot, Text) 
(eves LP roe) (reno) pg 2 ot, Gext) 
== Ce VCIXEIAEIAAT IAC) £27 F9 EY BE’, exe) 
Le eA ay 2 hy Text) 


Smileys are complicated by the fact that their uses change frequently; while this series of characters 1s 
reasonably current, it's by no means complete; for example, see emojis for a range of non-asci1 
representations. For several reasons, we'll be removing non-ascii text from this example (a similar 
approach 1s to use a dictionary to force compliance), but both approaches have the obvious drawback 
that they remove cases from the dataset, meaning that any solution will be imperfect. In some cases, 
this approach may lead to the removal of a substantial amount of data. In general, then, it's prudent to 
be aware of the general challenge around character-based images 1n text content. 


Following this, we want to begin splitting text intd, BikaSes, This 1s a simple application of 


str.split, which enables the input to be treated as a vector of words (words) rather than as long 
strings (re): 


phrases = re.split(r'[;:\.()\n]', text) 

phrases = [re.findall(r' [\ws\*&#]+', ph) for ph in phrases] 
phrases = [ph for ph in phrases if ph] 

words = [] 


for ph in phrases: 
words.extend (ph) 


This gives us the following: 





pxofeoizosnonaar LL * Flip, “Mow'; “are*, "you"; “not", "ded? |'] 


Next, we perform a search for single-letter sequences. Sometimes, for emphasis, Internet 
communication involves the use of spaced single-letter chains. This may be attempted as a method of 
avoiding curse word detection: 


tmp = words 
words = [|] 

new word = !'! 
fOr WOrG. ssi mo: 


1f len(word) == 1: 
new word = new word + word 
else: 


if new word: 
words.append (new word) 
new word = '' 

words.append (word) 


So far, then, we've gone a long way toward cleaning and improving the quality of our input data. 
There are still outstanding issues, however. Let's reconsider the example we began with, which now 
looks like the following: 





pxeorzosnonaar [* Fy *ROW', “are", “You; “Mot”, “dea” | 


Much of our early cleaning has passed this example by, but we can see the effect of vectorising the 
sentence content as well as the now-cleaned HTML tags. We can also see that the emote used has 
been captured via the F tag. When we look at a more complex test case, we see even more 
substantial change results: 
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Raw i aned and split | 


GALLUP DAILY\nMay 24-26, 2012 \u2013 Updates daily at 1][['GALLUP', 'DAILY', 'May', 'u', 'Updates', 'daily', 'pm', 
p.m. ET; reflects one-day change\nNo updates Monday, 'ET', 'reflects', 'one', 'day', 'change', 'No', 

May 28; next update will be Tuesday, May 29.\nObama 'updates', 'Monday', 'May', 'next', 'update', 'Tuesday', 
Approval48%-\nObama Disapproval45%-1\nPRESIDENTIAL 'May', 'Obama', 'Approval', 'Obama', 'Disapproval', 


ELECTION\nObama47%-\nRomney45%-\n7-day rolling 'PRESIDENTIAL', 'ELECTION', 'Obama', 'Romney', 'day', 
average\n\n It seems the bump Romney got is over and 'rolling', ‘average’, '‘It';, ‘seems’, “bump', ‘Romney’, 


the president is on his game. 'got', 'president', 'game'] 





However, there are two significant problems still obvious in both examples. In the first case, we have 
a misspelled word; we need to find a way to eliminate this. Secondly, a lot of the words 1n both 
examples (for example. are, pm) aren't terribly informative in and of themselves. The problem we 
find, particularly for shorter text samples, is that what's left after cleaning may contain only one or 
two meaningful terms. If these terms are not terribly common in the corpus as a whole, it can prove to 
be very difficult to train a classifier to recognise these terms' significance. 


Tagging and categorising words 


I expect that we all know that English language words come in several types—nouns, verbs, adverbs, 
and so on. These are commonly referred to as parts of speech. If we know that a certain word is an 
adjective, as opposed to a verb or stop word (such as a, the, or of), we can treat it differently or more 
importantly, our algorithm can! 


If we can perform part of speech tagging by identifying and encoding word classes as categorical 
variables, we're able to improve the quality of our data by retaining only the valuable content. The 
full range of text tagging options and techniques 1s too broad to be effectively covered in one section 
of this chapter, so we'll look at a subset of the applicable tagging techniques. Specifically, we'll focus 
on n-gram tagging and backoff taggers, a pair of complimentary techniques that allow us to create 
powerful recursive tagging algorithms. 


We'll be using a Python library called the Natural Language Toolkit (NLTK). NLTK offers a wide 
array of functionality and we'll be relying on it at several points in this chapter. For now, we'll be 
using NLTK to perform tagging and removal of certain word types. Specifically, we'll be filtering out 
stop words. 


To answer the obvious first question (why eliminate stop words’), 1t tends to be true that stop words 
add little to nothing to most text analysis and can be responsible for a degree of noise and training 
variance. Fortunately, filtering stop words is pretty straightforward. We'll simply import NLTK, 
download and import the dictionaries, then perform a scan over all words in our pre-existing word 
vector, removing any stop words found: 


import nltk 
niltk.download() 
From TitkK Corpus IMpOrl SLOPWOras 


words = [w for w in words if not w in stopwords.words ("english") ] 
WOW! eBook 
www.wowebook.org 


I'm sure you'll agree that this was pretty straightforward! Let's move on to discuss more NLTK 
functionality, specifically, tagging. 


Tagging with NLTK 


Tagging is the process of identifying parts of speech, as we described previously, and applying tags to 
each term. 


In its simplest form, tagging can be as straightforward as applying a dictionary over our input data, 
just as we did previously with stopwords: 


tagged = nlUlk.word tokeni ze (words) 


However, even brief consideration will make it obvious that our use of language is a lot more 
complicated than this allows. We may use a word (such as ferry) as one of several parts of speech 
and it may not be straightforward to decide how to treat each word in every utterance. A lot of the 
time, the correct tag can only be understood contextually given the other words and their positioning 
within the phrase. 


Thankfully, we have a number of useful techniques available to help us solve linguistic challenges. 
Sequential tagging 


A sequential tagging algorithm is one that works by running through the input dataset, left-to-right and 
token-by-token (hence sequential!), tagging each token 1n succession. The decision over which token 
to assign is made based on that token, the tokens that preceded it, and the predicted tags for those 
preceding tokens. 


In this section, we'll be using an n-gram tagger. An n-gram tagger 1s a type of sequential tagger, 
which 1s pretrained to identify appropriate tags. The n-gram tagger takes (n-1)-many preceding POS 
tags and the current token into consideration in producing a tag. 


Note 


For clarity, an n-gram is the term used for a contiguous sequence of n-many elements from a given set 
of elements. This may be a contiguous sequence of letters, words, numerical codes (for example, for 
state changes), or other elements. N-grams are widely used as a means of capturing the conjunct 
meaning of sets of elements—be those phrases or encoded state transitions—using n-many elements. 


The simplest form of n-gram tagger 1s one where n = /, referred to as a unigram tagger. A unigram 
tagger operates quite simply, by maintaining a conditional frequency distribution for each token. This 
conditional frequency distribution is built up from a training corpus of terms; we can implement 
training using a helpful train method belonging to the NgramTagger class in NLTK. The tagger 
assumes that the tag which occurs most frequently for a given token 1n a given sequence 1s likely to be 
the correct tag for that token. If the term carp is 1n the training corpus as a noun four times and as a 
verb twice, a unigram tagger will assign the noun | tag to any token whose type is carp. 
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This might suffice for a first-pass tagging attempt but clearly, a solution that only ever serves up one 
tag for each set of homonyms isn't always going to be ideal. The solution we can tap into 1s using n- 
erams with a larger value of n. Withn = 3 (a trigram tagger), for instance, we can see how the 
tagger might more easily distinguish the input He tends to carp on a lot from He caught a 
magnificent carp! 


However, once again there is a trade-off here between accuracy of tagging and ability to tag. As we 
increase n, we're creating increasingly long n-grams which become increasingly rare. In a very short 
time, we end up 1n a Situation where our n-grams are not occurring in the training data, causing our 
tagger to be unable to find any appropriate tag for the current token! 


In practice, we find that what we need 1s a set of taggers. We want our most reliably accurate tagger 
to have the first shot at trying to tag a given dataset and, for any case that fails, we're comfortable with 
having a more reliable but potentially less accurate tagger have a try. 


Happily, what we want already exists in the form of backoff tagging. Let's find out more! 
Backoff tagging 


Sometimes, a given tagger may not perform reliably. This is particularly common when the tagger has 
high accuracy demands and limited training data. At such times, we usually want to build an ensemble 
structure that lets us use several taggers simultaneously. 


To do this, we create a distinction between two types of taggers: subtaggers and backoff taggers. 
Subtaggers are taggers like the ones we saw previously, sequential and Brill taggers. Tagging 
structures may contain one or multiple of each kind of tagger. 


If a subtagger is unable to determine a tag for a given token, then a backoff tagger may be referred to 
instead. A backoff tagger 1s specifically used to combine the results of an ensemble of (one or more) 
subtaggers, as shown in the following example diagram: 


Backoff Tagger 


Try First Try Second Try Nth 
Subtagaer subtagger Subtagger 


Success? Success ? SUCCESS ? 


7 | Yes No 
Yes No Backorf! 








| | 
Backoff 


Move on to tagging the next term! 





WOW! eBook 
www.wowebook.org 


In simple implementations, the backoff tagger will simply poll the subtaggers 1n order, accepting the 
first none-null tag provided. If all subtaggers return null for a given token, the backoff tagger will 
assign a none tag to that token. The order can be determined. 


Backoffs are typically used with multiple subtaggers of different types; this enables a data scientist to 
harness the benefits of multiple types of tagger simultaneously. Backoffs may refer to other backofts 
as needed, potentially creating highly redundant or sophisticated tagging structures: 


Try a Unigram 
Tagger 


Success? 





_ Try a Bigram 
Tagger 


Success? 


Try a Trigram he No 
Tagger 


| Backoffl 


Success? 


ac 
= 


Backoff! 


Move on to tagging the next term! 





In general terms, backoff taggers provide redundancy and enable you to use multiple taggers ina 
composite solution. To solve our immediate problem, let's implement a nested series of n-gram 
taggers. We'll start with a trigram tagger, which will use a bigram tagger as its backoff tagger. If 
neither of these taggers has a solution, we'll have a unigram tagger as an additional backoff. This can 
be done very simply, as follows: 


brown 2 = nITK.cOrpus.Drown.Cagged Senus(cCaregoriecs= “a") 
tagger = None 
for n in range(1,4): 

tagger = Nogramlagqger (nm, OrOWn ay Dackoli = Tagger) 
words = tagger.tag (words) 
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Creating features from text data 


Once we've engaged in well-thought-out text cleaning practices, we need to take additional steps to 
ensure that our text becomes useful features. In order to do this, we'll look at another set of staple 
techniques in NLP: 


e Stemming 
e Lemmatising 
e Bagging using random forests 


Stemming 


Another challenge when working with linguistic datasets is that multiple word forms exist for many 
word stems. For example, the root dance is the stem of multiple other words—dancing, dancer, 
dances, and so on. By finding a way to reduce this plurality of forms into stems, we find ourselves 
able to improve our n-gram tagging and apply new techniques such as lemmatisation. 


The techniques that enable us to reduce words to their stems are called stemmers. Stemmers work by 
parsing words as consonant/vowel strings and applying a series of rules. The most popular stemmer 
is the porter stemmer, which works by performing the following steps; 


1. Simplifying the range of suffixes by reducing (for example, ies becomes 7) to a smaller set. 

2. Removing suffixes in several passes, with each pass removing a set of suffix types (for example, 
past particple or plural suffixes such as ousness or alism). 

3. Once all suffixes are removed, cleaning up word endings by adding 'e's where needed (for 
example, ceas becomes cease). 

4. Removing double 'l's. 


The porter stemmer works very effectively. In order to understand exactly how well it works, let's see 
it in action! 


from nltk.stem import PorterStemmer 
stemmer = PortersStemmer () 


stemmer.stem(words) 


The output of this stemmer, as demonstrated on our pre-existing example, is the root form of the 
word. This may be a real word, or 1t may not; dancing, for instance, becomes danc1. This 1s okay, but 
it's not really ideal. We can do better than this! 


To consistently reach a real word form, let's apply a slightly different technique, lemmatisation. 
Lemmatisation is a more complex process to determine word stems; unlike porter stemming, it uses a 
different normalisation process for different parts of speech. Unlike Porter Stemming it also seeks to 
find actual roots for words. Where a stem does not have to be a real word, a lemma does. 


Lemmatization also takes on the challenge of t@U¢nfe@slynonyms down to their roots. For example, 
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where a stemmer might turn the term books into the term book, it isn't equipped to handle the term 
tome. A lemmatizer can process both books and tome, reducing both terms to book. 


As a necessary prerequisite, we need the POS for each input token. Thankfully, we've already applied 
a POS tagger and can work straight from the results of that process! 


from nltk.stem import PorterStemmer, WordNetLemmatizer 
lemmatizer = WordNetLemmatizer () 


words = lemmatizer.lemmatize(words, pos = 'pos') 


The output is now what we'd expect to see: 


his own high-flying exits off moving beasts "high', 'fly', ‘exit', 'move', 'beast'] 


The laughs you two heard were triggered by memories of E the’, "leon, vrwo', “hear, “rrigger’, memory” , 





We've now successfully stemmed our input text data, massively improving the effectiveness of lookup 
algorithms (such as many dictionary-based approaches) in handling this data. We've removed stop 
words and tokenized a range of other noise elements with regex methods. We've also removed any 
HTML tagging. Our text data has reached a reasonably processed state. There's one more linchpin 
technique that we need to learn, which will let us generate features from our text data. Specifically, 
we can use bagging to help quantify the use of terms. 


Let's find out more! 


Bagging and random forests 


Bagging is part of a family of techniques that may collectively be referred to as subspace methods. 
There are several forms of method, with each having a separate name. If we draw random subsets 
from the sample cases, then we're performing pasting. If we're sampling from cases with replacement, 
it's referred to as bagging. If instead of drawing from cases, we work with a subset of features, then 
we're performing attribute bagging. Finally, if we choose to draw from both sample cases and 
features, we're employing what's known as a random patches technique. 


The feature-based techniques, attribute bagging, and Random Patch methods are very valuable in 
certain contexts, particularly very high-dimensional ones. Medical and genetics contexts both tend to 
see a lot of extremely high-dimensional data, so feature-based methods are highly effective within 
those contexts. 


In NLP contexts, it's common to work with bagging specifically. In the context of linguistic data, what 
we'll be dealing with is properly called a bag of words. A bag of words 1s an approach to text data 
preparation that works by identifying all of the distinct words (or tokens) in a dataset and then 


counting their occurrence in each sample. Let'W BWA? kth a demonstration, performed over a couple 
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of example cases from our dataset: 


pxefeorzosnonaar [* 2; “How; *are*, “you, ‘not", *ded* | 
fs [eoszosnz7ns L*you", “are’, *Living'; "“proot’, "that’, *bath", *salts*, *eitect’, "thinking”’ | 


This gives us the following 12-part list of terms: 





[ 


"ded" 
"living" 
"proof" 
"that" 
"bath" 
"salts" 
"effect" 
"thinking" 


Using the indices of this list, we can create a 12-part vector for each of the preceding sentences. This 
vector's values are filled by traversing the preceding list and counting the number of times each term 
occurs for each sentence in the dataset. Given our pre-existing example sentences and the list we 
created from them, we end up creating the following bags: 


201205310319172ZH F how are you not ded 
fs enszosxarno you are living proof that bath salts effect thinking 


This 1s the core of a bag of words implementation. Naturally, once we've translated the linguistic 
content of text into numerical vectors, we're able to start using techniques that add sophistication to 
how we use this text in classification. 





One option is to use weighted terms. We can use a term weighting scheme to modify the values within 
each vector so that terms that are indicative or helpful for classification are emphasized. Weighting 
schemes may be straightforward masks, such as a binary mask that indicates presence versus absence. 


Binary masking can be useful if certain terms ateVise@thuch more frequently than normal; in such 
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cases, specific scaling (for example, log-scaling) may be needed if a binary mask is not used. At the 
same time, though, frequency of term use can be informative (it may indicate emphasis, for instance) 
and the decision over whether to apply a binary mask is not always made simply. 


Another weighting option is term frequency-inverse document frequency, or tf-idf. This scheme 
compares frequency of usage within a specific sentence and the dataset as a whole and uses values 
that increase if a term is used more frequently within a given sample than within the whole corpus. 


Variations on tf-idf are frequently used in text mining contexts, including search engines. Scikit-learn 
provides a tf-idf implementation, TfidfVectoriser, which we'll shortly use to employ tf-idf for 
ourselves. 


Now that we have an understanding of the theory behind bag of words and can see the range of 
technical options available to us once we develop vectors of word use, we should discuss how a bag 
of words implementation can be undertaken. Bag of words can be easily applied as a wrapper to a 
familiar model. While in general, subspace methods may use any of a range of base models (SVMs 
and linear regression models are common), it 1s very common to use random forests in a bag of words 
implementation, wrapping up preparation and learning into a concise script. In this case, we'll employ 
bag of words independently for now, saving classification via a random forest implementation for the 
next section! 


Note 


While we'll discuss random forests in greater detail in Chapter 8, Ensemble Methods, (which 
describes the various types of ensemble that we can create), itis helpful for now to note that a 
random forest is a set of decision trees. They are powerful ensemble models that are created either to 
run in parallel (yielding a vote or other net outcome) or boost one another (by iteratively adding a 
new tree to model the parts of the solution that the pre-existing set of trees couldn't model well). 


Due to the power and ease of use of random forests, they are commonly used as benchmarking 
algorithms. 


The process of implementing bag of words 1s, again, fairly straightforward. We initialize our bagging 
tool (matter-of-factly referred to as a vectorizer). Note that for this example, we're putting a limit on 
the size of the feature vector. This is largely to save ourselves some time; each document must be 
compared against each item 1n the feature list, so when we get to running our classifier this could take 
a little while! 


EEOM SkKiLearnaleature SxtraclloOn.Text amport TE1crvectorizer 


vectorizer = TfidfVectorizer(analyzer = "word", 
tokenizer = None, 
DESCpPLOCessOoLr = None; 
SLOp. WOLGS =: Nome, 
max features = 5000) 


BO OB BE ge 
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Our next step 1s to fit the vectorizer on our word data via fit transform; as part of the fitting 
process, our data 1s transformed into feature vectors: 


bean Oates EOCalures = VeECeOrizer.tit. Eran slorm (words) 


train Gate Teatures = [rain Gata Tearures,toarray() 


This completes the pre-processing of our text data. We've taken this dataset through a full battery of 
text mining techniques, walking through the theory and reasoning behind each technique as well as 
employing some powerful Python scripts to process our test dataset. We're 1n a good position now to 
take a crack at Kaggle's insult detection challenge! 
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Testing our prepared data 


So, now that we've done some initial preparation of the dataset, let's give the real problem a shot and 
see how we do. To help set the scene, let's consider Impermium's guidance and data description: 


This is a single-class classification problem. The label is either 0 meaning a neutral comment, or 1 
meaning an insulting comment (neutral can be considered as not belonging to the insult class. Your 
predictions must be a real number in the range [0,1] where I indicates 100% confident prediction 
that comment is an insult. 


e We are looking for comments that are intended to be insulting to a person who is a part of the 
larger blog/forum conversation. 

e We are NOT looking for insults directed to non-participants (such as celebrities, public 

figures etc.). 

Insults could contain profanity, racial slurs, or other offensive language. But often times, they 

do not. 

Comments which contain profanity or racial slurs, but are not necessarily insulting to 

another person are considered not insulting. 

The insulting nature of the comment should be obvious, and not subtle. 

There may be a small amount of noise in the labels as they have not been meticulously 

cleaned. However, contestants can be confident the error in the training and testing data is < 


1%. 


Contestants should also be warned that this problem tends to strongly overfit. The provided data is 
generally representative of the full test set, but not exhaustive by any measure. Impermium will be 
conducting final evaluations based on an unpublished set of data drawn from a wide sample. 


This 1s pretty nice guidance, in that it raises two particular points of caution. The desired score is the 
area under the curve (AUC), which is a measure that is very sensitive both to false positives and to 
incorrect negative results (specificity and sensitivity). 


The guidance clearly states that continuous predictions are desired rather than binary 0// outputs. 
This becomes critically important when using AUC; even a small amount of incorrect predictions 
given will radically decrease one's score if you only use categorical values. This suggests that rather 
than using the RandomForestClassifier algorithm, we'll want to use the RandomForestRegressor, 
a regression-focused alternative, and then rescale the results between zero and one. 


Real Kagegle contests are run in a much more challenging and realistic environment—one where the 
correct solution 1s not available. In Chapter 8, Ensemble Methods, we'll explore how top data 
scientists react and thrive in such environments. For now, we'll take advantage of having the ability to 
confirm whether we're doing well on the test dataset. Note that this advantage also presents a risk; if 
the problem overfits strongly, we'll need to be disciplined to ensure that we're not overtraining on the 


test data! 
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In addition, we also have the benefit of being able to see how well real contestants did. While we'll 
save the real discussion for Chapter 8, Ensemble Methods, it's reasonable to expect each highly- 
ranking contestant to have submitted quite a large number of failed attempts; having a benchmark will 
help us tell whether we're heading 1n the right direction. 


Specifically, the top 14 participants on the private (test) leaderboard managed to reach an AUC score 
of over 0.8. The top scorer managed a pretty impressive 0.84, while over half of the 50 teams who 
entered scored above 0.77. 


As we discussed earlier, let's begin with a random forest regression model. 


Note 
A random forest is an ensemble of decision trees. 


While a single decision tree is likely to suffer from variance- or bias-related issues, random forests 
are able to use a weighted average over multiple parallel trials to balance the results of modeling. 


Random forests are very straightforward to apply and are a good first-pass technique for a new data 
challenge; applying a random forest classifier to the data early on enables you to develop a good 
understanding as to what initial, baseline classification accuracy will look like as well as giving 
valuable insight into how the classification boundaries were formed; during the initial stages of 
working with a dataset, this kind of insight is invaluable. 


Scikit-learn provides RandomForestClassifier to enable the easy application of a random forest 
algorithm. 


For this first pass, we'll use 100 trees; increasing the number of trees can improve classification 
accuracy but will take additional time. Generally speaking, it's sensible to attempt fast iteration 1n the 
early stages of model creation; the faster you can repeatedly run your model, the faster you can learn 
what your results are looking like and how to improve them! 


We begin by initializing and training our model: 


CrOLISpOtter = RanmcomrorestRegressor(m estimators = 100, Max .depth = 10, 
max features = 1000) 


y = eeoLie | Sy | 


LEOLUSpOEESC. = troLispocter.21t (train dara Teatures, 7) 


We then grab the test data and apply our model to predict a score for each test case. We rescale these 
scores using a simple stretching technique: 


MOrSCLTOLUS = PG. ECad csv MOrecrolls.csy', MNeader—lruc, Mames—|"y',; “Gate, 
'‘Comment', 'Usage']) 
morecrrolls| “words” = moretrolls["Comment"].appl cleaner 

| | CORB SBdoIePP-Y I 
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y = moretrolls["y"] 


test dale features — vector 761. fiat Cranstormi{morerrol le | "Words |) 
Lest Oadta tealures: = Test Calta Tealures.toarray () 


Pred = 'prec.preqicl (test data features) 
pred (pred - pred.min())/(pred.max() - pred.min() ) 


Finally, we apply the roc auc function to calculate an AUC score for the model: 


fpiy UPpry . = 2OC cUurvely, pred) 

POC auc. = auc(ipr, Tpr) 

print("Random Forest benchmark AUC score, 100 estimators") 
PELNE (LOC auc) 


As we can see, the results are definitely not at the level that we want them to be at: 


Random Forest benchmark AUC score, 100 estimators 
0.537894912105 


Thankfully, we have a number of options that we can try to configure here: 


e Our approach to how we work with the input (preprocessing steps and normalisation) 

e The number of estimators 1n our random forest 

e The classifier we choose to employ 

e The properties of our bag of words implementation (particularly the maximum number of terms) 
e The structure of our n-gram tagger 


On our next pass, let's adjust the size of our bag of words implementation, increasing the term cap 
from a slightly arbitrary 5,000 to anywhere up to 8,000 terms; rather than picking just one value, we'll 
run over a range and see what we can learn. We'll also increase the number of trees to a more 
reasonable number (in this case, we stepped up to 1000): 


Random Forest benchmark AUC score, 1000 estimators 
0.546439310772 


These results are slightly better than the previous set, but not dramatically so. They're definitely a fair 
distance from where we want to be! Let's go further and set up a different classifier. Let's try a fairly 
familiar option—the SVM. We'll set up our own SVM object to work with: 


class SVM(object): 
Get. .AMat  ‘(selt, Texts, Classes, nl paicit—None): 


selt,.svm = SVM.hinearovC(C=10007 Class Wei1ght=—"aulo”) 
as, “leo Gl ase 

self.dictionary = nlpdict 
else: 

self.dictionary = NLPDict (texts=texts) 


Sselit. train(texts, Classes) 
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Oct. trolotseeli, Gexts, Clasece) 
VECCOrSs = Selt,OVCliOnarty.Tealiire VeCtore (texts) 
self.svm.fit(vectors, classes) 


def classify(self, texts): 


VeEClOrs = Sellt.cdiCleionary.Tteature Vectors (Texts) 
Precitelione = Sell .ovN.Ceeclston TUnCe Ton (veclore) 
predictions = p.transpose (predictions) [0:len(predictions) | 
predictions = predictions / 2 + 0.5 
predictions[predictions > 1] = 1 

predictions[predictions < 0] = 0 


return predictions 


While the workings of svm are almost impenetrable to human assessment, as an algorithm it operates 
effectively, iteratively translating the dataset into multiple additional dimensions in order to create 
complex hyperplanes at optimal class boundaries. It isn't a huge surprise, then, to see that the quality 
of our classification has increased: 


SVM AUC score 
0.625245653817 


Perhaps we're not getting enough visibility into what's happening with our results. Let's try shaking 
things up with a different take on performance measurement. Specifically, let's look at the difference 
between the model's label predictions and actual targets to see whether the model 1s failing more 
frequently with certain types of input. 


So we've taken our prediction quite far. While we still have a number of options on the table, it's 
worth considering the use of a more sophisticated ensemble of models as being a solid option. In this 
case, leveraging multiple models instead of just one can enable us to obtain the relative advantages of 
each. To try out an ensemble against this example, run the score trolls blendedensemble.py 
script. 


Note 


This ensemble 1s a blended/stacked ensemble. We'll be spending more time discussing how this 
ensemble works in Chapter 8, Ensemble Methods! 


Plotting our results, we can see that performance has improved, but by significantly less than we'd 
hoped: 
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ROC Curve 
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We're clearly having some issues with building a model against this data, but at this point, there isn't a 
lot of value in throwing a more developed model at the problem. We need to go back to our features 
and aim to extend the feature set. 


At this point, it's worth taking some pointers from one of the most successful entrants of this particular 
Kaggle contest. In general, top-scoring entries tend to be developed by finding all of the tricks around 
the input data. The second-place contestant 1n the official Kaggle contest that this dataset was drawn 
from was a user named tuzzeg. This contestant provided a usable code repository at 


https://github.com/tuzzeg/detect_insults. 


Tuzzeg's implementation differs from ours by virtue of much greater thoroughness. In addition to the 
basic features that we built using POS tagging, he employed POS-based bigrams and trigrams as well 
as subsequences (created from sliding windows of N-many terms). He worked with n-grams up to 7- 
grams and created character n-grams of lengths 2, 3, and 4. 


Furthermore, tuzzeg took the time to create two types of composite model, both of which were 
incorporated into his solution—sentence level and ranking models. Ranking took our rationalization 
around the nature of the problem a step further by turning the cases in our data into ranked continuous 
values. 


Meanwhile, the innovative sentence-level model that he developed was trained specifically on 
single-sentence cases in the training data. For prediction on test data, he split the cases into sentences, 
evaluated each separately, and took only the highest score for sentences within the case. This was to 
accommodate the expectation that in natural language, speakers will frequently confine insulting 
comments to a single part of their speech. 


Tuzzeg's model created over 100 feature groups (where a stem-based bigram is an example feature 
group—a group in the sense that the bigram process creates a vector of features), with the most 

; k 

important ones (ranked by impact) being the fT OWiA orc 
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stem subsequence based 0.66 
stem based (unigrams, bigrams) 0.18 
char ngrams based (sentence) OT 
char ngrams based 0.04 
all syntax 0.006 
all language models 0.004 
all mixed 0.002 


This 1s interesting, in that 1t suggests that a set of feature translations that we aren't currently using 1s 
important in generating a usable solution. Particularly, the subsequence-based features are only a 
short step from our initial feature set, making it straightforward to add the extra feature: 


def subsegq2(n, xs): 

1 = lLlen(xs) 

return [*ss 7s' «= (xs[aly xSig)]) fEor 1. in xrange(l=1) for J] in xrange(itl, 
neh) a a <= Ak] 


def getSubseg2 (seqF, n): 
def f(row): 
seq = seqF (row) 
return set(seq + subseg2(n, seq) ) 
return £ 


Subseq2test = getSubseq2 (line, 2) 


This approach yields excellent results. While I'd encourage you to export Tuzzeg's own solution and 
apply it, you can also look at the score trolls withsubseq.py script provided in this project's 
repository to get a feeling for how powerful additional features can be incorporated. 


With these additional features added, we see a dramatic improvement in our AUC score: 


ROC Curve 
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Running this code provides a very healthy 0.834 AUC score. This simply goes to show the power of 
thoughtful and innovative feature engineering; woyle the,Specific features generated in this chapter 
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will serve you well in other contexts, specific hypotheses (such as hostile comments being isolated to 
specific sentences within a multi-sentence comment) can lead to very effective features. 


As we've had the luxury of checking our reasoning against test data throughout this chapter, we can't 
reasonably say that we've worked under life-like conditions. We didn't take advantage of having 
access to the test data by reviewing it ourselves, but it's fair to say that knowing what the private 
leaderboard scored for this challenge made it easier for us to target the right fixes. In Chapter 8, 
Ensemble Methods, we'll be working on another tricky Kaggle problem in a more rigorous and 
realistic way. We'll also be discussing ensembles 1n depth! 
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Further reading 


The quotes at the start of this chapter were sourced from the highly-readable Kaggle blog, No Free 
Hunch. Refer to http://blog.kaggle.com/2014/08/01/learning-from-the-best/. 


There are many good resources for understanding NLP tasks. One fairly thorough, eight-part piece, 1s 
available online at http://textminingonline.com/dive-into-nltk-part-1-getting-started-with-nltk. 


If you're keen to get started, one great option is to try Kagegle's for Knowledge NLP task, which is 
perfectly suited as a testbed for the techniques described in this chapter: 
https://www.kagele.com/c/word2vec-nlp-tutorial/details/part- 1-for-beginners-bag-of-words. 





The Kagegle contest cited in this chapter 1s available at https://www.kaggle.com/c/detecting-insults- 
in-social-commentary. 


For readers interested in further description of the ROC curve and the AUC measure, consider Tom 
Fawcett's excellent introduction, available at 


https://ccrma.stanford.edu/workshops/mir2009/references/ROCintro. pdf. 
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Summary 


We've been introduced to a lot of useful and highly applicable skills in this chapter. In this chapter, 
we took a set of messy, complication-strewn text data and, through a series of rigorous steps, turned it 
into a large set of effective features. We began by picking up a set of data cleaning skills which 
stripped out a lot of the noise and problem elements, then we followed up by turning text into features 
using POS tagging and bag of words. In the process, you learned to apply a set of techniques that are 
widely applicable and very empowering, enabling us to solve difficult problems in many natural 
language processing contexts. 


Through experimentation with multiple individual models and ensembles, we discovered that where a 
smarter algorithm might not yield a strong result, thorough and creative feature engineering can yield 
massive improvements in model performance. 
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Introduction 


We have recognized the importance of feature engineering. In the previous chapter, we discussed 
techniques that enable us to select from a range of features and work effectively to transform our 
original data into features, which can be effectively processed by the advanced ML algorithms that 
we have discussed thus far. 


The adage garbage in, garbage out 1s relevant in this context. In earlier chapters, we have seen how 
image recognition and NLP tasks require carefully-prepared data. In this chapter, we will be looking 
at a more ubiquitous type of data: quantitative or categorical data that is collected from real-world 
applications. 


Data of the type that we will be working with 1n this chapter 1s common to many contexts. We could 
be discussing telemetry data captured from sensors 1n a forest, game consoles, or financial 
transactions. We could be working with geological survey information or bioassay data collected 
through research. Regardless, the core principles and techniques remain the same. 


In this chapter, you will be learning how to interrogate this data to weed out or mitigate quality issues, 
how to transform it into forms that are conducive to machine learning, and how to creatively enhance 
that data. 


In general terms, the concepts that we'll be discussing 1n this chapter are as follows: 


e The different approaches to feature set creation and the limits of feature engineering 

e How to use a large set of techniques to enhance and improve an initial dataset 

e How to tie in and use domain knowledge to understand valid options to transform and improve 
the clarity of existing data 

e How we can test the value of individual features and feature combinations so that we only keep 
what we need 


While we will begin with a detailed discussion of the underlying concepts, by the end of this chapter 
we will be working with multiple, iterative trials and using specialized tests to understand how 
helpful the features that we are creating will be to us. 
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Creating a feature set 


The most important factor involved in successful machine learning 1s the quality of your input data. A 
good model with misleading, inappropriately normalized, or uninformative data will not see the same 
level of success anywhere near a model run over appropriately prepared data. 


In some cases, you have the ability to specify data collection or have access to a useful, sizeable, and 
varied set of source data. With the right knowledge and skillset, you can use this data to create highly 
useful feature sets. 


In general, having a strong knowledge as to how to construct good feature sets is very helpful as it 
enables you to audit and assess any new dataset for missed opportunities. In this chapter, we will 
introduce a design process and technique set that make it easier to create effective feature sets. 


As such, we'll begin by discussing some techniques that we can use to extend or reinterpret existing 
features, potentially creating a large number of useful parameters to include 1n our models. 


However, as we will see, there are limitations on the effective use of feature engineering techniques 
and we need to be mindful of the risks around engineered datasets. 
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Engineering features for ML applications 


We have discussed what you can do about patching up data quality issues in your data and we have 
talked about how you can creatively use dimensions in what you have to join to external data. 


Once you have a reasonably well-understood and quality-checked set of data in front of you, there 1s 
usually still a significant amount of work needed before you can produce effective models from that 
data. 


Using rescaling techniques to improve the learnability of features 


The main challenge with directly feeding unprepared data to many machine learning models is that the 
algorithm is sensitive to the relative size of different variables. If your dataset has multiple 
parameters whose ranges differ, some algorithms will treat the variables whose variance is greater as 
indicative of more significant change than algorithms with smaller values and less variance. 


The key to resolving this potential problem 1s rescaling, a process by which parameter values' 
relative size 1s adjusted while retaining the initial ordering of values within each parameter (a 
monotonic translation). 


Gradient descent algorithms (which include most deep learning algorithms 
—http://sebastianruder.com/optimizing-gradient-descent/) are significantly more efficient if the input 





data is scaled prior to training. To understand why, we'll resort to drawing some pictures. A given 
series of training steps may appear as follows: 





When applied to unscaled data, these training steps may not converge effectively (as per the left-hand 
example in the following diagram). 


With each parameter having a differing scale, We“pafanieter space in which models are attempting to 
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train can be highly distorted and complex. The more complex this space, the harder it becomes to 
train a model within it. This 1s an involved subject that can be effectively described, in general terms, 
through a metaphor, but for readers looking for a fuller explanation there is an excellent reference in 
this chapter's Further reading section. For now, it is not unreasonable to think in terms of gradient 
descent models during training as behaving like marbles rolling down a slope. These marbles are 
prone to getting stuck in saddle points or other complex geometries on the slope (which, 1n this 
context, 1s the surface created by our model's objective function—the learning function whose output 
our models typically train to minimize). With scaled data, however, the surface becomes more 
regularly-shaped and training can become much more effective: 





The classic example is a linear rescaling between 0 and /; with this method, the largest parameter 
value 1s rescaled to /, the smallest to 0, with intermediate values falling in the 0-/ interval, 
proportionate to their original size relative to the largest and smallest values. Under such a 
transformation, the vector /0,10,25,20,18/, for instance, would become /0,0.4, J, 0.8, 0.72/. 


The particular value of this transformation is that, for multiple data points that may vary 1n magnitude 
in its raw form, the rescaled features will sit within the same range, enabling your machine learning 
algorithm to train on meaningful information content. 


This is the most straightforward rescaling option, but there are some nonlinear scaling alternatives 
that can be much more helpful in the right circumstances; these include square scaling, square root 
scaling, and perhaps most commonly, log-scaling. 


Log-scaling of parameter values 1s very common in physics and in contexts where the underlying data 


is frequently affected by a power law (for example, an exponential growth 1n y given a linear increase 
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Unlike linear rescaling, log-scaling adjusts the relative spacing between data cases. This can be a 
double-edged sword. On the one hand, log-scaling handles outlying cases very well. Let's take an 
example dataset describing individual net wealth for members of a fictional population, described by 
the following summary statistics: 


Statistic Wealth 

Min 1 
First Quartile 42.5 
Mean 3205433.343 








Median 600 
Third Quartile 1358 
Max 10000000000 











Prior to rescaling, this population is hugely skewed toward that single individual with absurd net 
worth. The distribution of cases per decile 1s as follows: 


Count of Cases. 
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After log-scaling, this distribution is far friendlier: 
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Cou nt of Cases 








We could've chosen to take scaling further and drawn out the first half of this distribution more by 
doing that. In this case, log-10 normalization significantly reduces the impact of these outlying values, 
enabling us to retain outliers in the dataset without losing detail at the lower end. 


With this said, it's important to note that in some contexts, that same enhancement of clustered cases 
can enhance noise in variant parameter values and create the false impression of greater spacing 
between values. This tends not to negatively affect how log-scaling handles outliers; the impact 1s 
usually seen for groups of smaller-valued cases whose original values are very similar. 


The challenges created by introducing nonlinearities through log-scaling are significant and 1n 
general, nonlinear scaling is only recommended for variables that you understand and have a 
nonlinear relationship or trend underlying them. 


Creating effective derived variables 


Rescaling is a standard part of preprocessing in many machine learning applications (for instance, 
almost all neural networks). In addition to rescaling, there are other preparatory techniques, which 
can improve model performance by strategically reducing the number of parameters input to the 
model. The most common example is of a derived measure that takes multiple existing data points and 
represents them within a single measure. 


These are extremely prevalent; examples include acceleration (as a function of velocity values from 
two points 1n time), body mass index (as a function of height, weight, and age), and price-earnings 
(P/E) ratio for stock scoring. Essentially, any derived score, ratio, or complex measure that you ever 
encounter 1s a combination score formed from multiple components. 


For datasets in familiar contexts, many of these pre-existing measures will be well-known. Even in 


relatively well-known areas, however, lookin¥ {or FER? ‘Supporting measures or transformations using 
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a mix of domain knowledge and existing data can be very effective. When thinking through derived 
measure options, some useful concepts are as follows: 


e Two variable combinations: Multiplication, division, or normalization of the m parameter as a 
function of the m parameter. 

e Measures of change over time: A classic example here is acceleration or 7D change ina 
measure. In more complex contexts, the slope of an underlying time series function can be a 
helpful parameter to work with instead of working directly with the current and past values. 

e Subtraction of a baseline: Using a base expectation (a flat expectation such as the baseline 
churn rate) to recast a parameter in terms of that baseline can be a more immediately 
informative way of looking at the same variable. For the churn example, we could generate a 
parameter that describes churn in terms of deviation from an expectation. Similarly, in stock 
trading cases, we might look at closing price 1n terms of the opening price. 

e Normalization: Following on from the previous case, normalization of parameter values based 
on the values of another parameter or baseline that is dynamically calculated given properties of 
other variables. One example here is failed transaction rate; in addition to looking at this value 
as a raw (or rescaled) count, it often makes sense to normalize this in terms of attempted 
transactions. 


Creative recombination of these different elements lets us build very effective scores. Sometimes, for 
instance, a parameter that tells us the slope of customer engagement (declining or increasing) over 
time needs to be conditioned on whether that customer was previously highly engaged or hardly 
engaged, as a slight decline in engagement might mean very different things in each context. It 1s the 
data scientist's job to effectively and creatively feature sets that capture these subtleties for a given 
domain. 


So far, this discussion has focused on numerical data. Often, however, useful data 1s locked up inside 
non-numeric parameters such as codes or categorical data. Accordingly, we will next discuss a set of 
effective techniques to turn non-numeric features into usable parameters. 


Reinterpreting non-numeric features 


A common challenge, which can be problematic and problem-specific, 1s how non-numeric features 
are treated. Frequently, valuable information is encoded within non-numerical shorthand values. In 
the case of stock trades, for instance, the identity of the stock itself (for example, AAPL) as well as 
that of the buyer and seller is interesting information that we expect to relate meaningfully to our 
problem. Taking this example further, we might also expect some stocks to trade differently from 
others even within the industry, and organizational differences within companies, which may occur at 
some or all points of time, also provide important context. 


One simple option that works 1n some cases 1s building an aggregation or series of aggregations. The 
most obvious example is a count of occurrences with the possibility of creating extended measures 
(changes 1n count between two time windows) as described 1n the preceding section. 


Building summary statistics and reducing the W&beaFet'tows in the dataset introduces the risk of 
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reducing the amount of information that your model has available to learn from (increasing the risk of 
model fragility and overfitting). As such, it's generally a bad idea to extensively aggregate and reduce 
input data. This 1s doubly true with deep learning techniques, such as the algorithms discussed and 
used in Chapters 2-4. 


Rather than extensively using ageregation-based approaches, let's look at an alternative way of 
translating string-encoded values into numerical data. Another very popular class of techniques is 
encoding, with the most common encoding tactic being one-hot encoding. One-hot encoding 1s the 
process of turning a series of categorical responses (for example, age groups) into a set of binary 
variables, with each response option (for example, 18-30) represented by its own binary variable. 
This 1s a little more intuitive when presented visually: 
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After encoding, this dataset of categorical and continuous variables becomes a tensor of binary 
variables: 


Age 23 Age 25 Age 34 Age 41 ~— Gender F_ Gender M 





The advantage that this presents is significant; 1t enables us to tap into the very valuable tag 
information contained within a lot of datasets without aggregation or risk of reducing the information 
content of the data. Furthermore, one-hot allows us to separate specific response codes for encoded 
variables into separate features, meaning that we can identify more or less meaningful codes for a 
specific variable and only retain the 1mportantyyaluesook 
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Another very effective technique, used primarily for text codes, 1s known as the hash trick. A hash, in 
simple terms, 1s a function that translates data into a numeric representation. Hashes will be a familiar 
concept to many, as they're frequently used to encode sensitive parameters and summarize otherwise 
bulky data. In order to get the most out of the hash trick, however, it's important to understand how the 
trick works and what can be done with it. 


We can use hashing to turn a text phrase into a numeric value that we can use as an identifier for that 
phrase. While there are many applications of different hashing algorithms, in this context even a 
simple hash makes it straightforward to turn string keys and codes into numerical parameters that we 
can model effectively. 


A very simple hash might turn each alphabet character into a corresponding number. a would become 
1, b would be 2, and so on. Hashes could be generated for words and phrases by summing those 
values. The phrase cat gifs would translate under this scheme as follows: 


Cat: 3 + 1+ 20 
Gifs: 7+9+6+ 419 
Total: 65 


This is a terrible hash for two reasons (quite disregarding the fact that the input contains junk words!). 
Firstly, there's no real limit on how many outputs it can present. When one remembers that the whole 
point of the hash trick is to provide dimensionality reduction, it stands to reason that the number of 
possible outputs from a hash must be bounded! Most hashes limit the range of numbers that they 
output, so part of the decision in terms of selecting a hash is related to the number of features you'd 
prefer your model to have. 


Note 


One common behavior 1s to choose a power of two as the hash range; this tends to speed things up by 
allowing bitwise operations during the hashing process. 


The other reason that this hash kind of sucks is that changes to the word have a small impact rather 
than a large one. If cat became bat, we'd want our hash output to change substantially. Instead, it 
changes by one (becoming 64). In general, a good hash function 1s one where a small change in the 
input text will cause a large change in the output. This is partly because language structures tend to be 
very uniform (thus scoring similarly), but slightly different sets of nouns and verbs within a given 
structure tend to confer very different meanings to one another (the cat sat on the mat versus the car 
sat on the cat). 


So we've described hashing. The hash trick takes things a little further. Hypothetically, turning every 
word into a hashed numerical code is going to lead to a large number of hash collisions—cases 
where two words have the same hash value. Naturally, these are rather bad. 


Handily, there's a distribution underlying how frequently different terms are used that work 1n our 


favor. Called the Zipf distribution, it entails thatthesarqbability of encountering the nth most common 
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term 1s approximated by P(n) = 0.1/n up to around 1,000 (Zipf's law). This entails that each term is 
much less likely to be encountered than the preceding term. After n = /000, terms tend to be 
sufficiently obscure that it’s unlikely to encounter two that have the same hash 1n one dataset. 


At the same time, a good hashing function has a limited range and 1s significantly affected by small 
changes in input. These properties make the hash collision chance largely independent of term usage 
frequency. 


These two concepts—Zipf's law and a good hash's independence of hash collision chance and term 
usage frequency—mean that there is very little chance of a hash collision, and where one occurs it 1s 
overwhelmingly likely to be between two infrequently-used words. 


This gives the hash trick a peculiar property. Namely, it is possible to reduce the dimensionality of a 
set of text input data massively (from tens of thousands of naturally occurring words to a few hundred 
or fewer) without reducing the performance of a model trained on hashed data, compared to training 
on unhashed bag-of-words features. 


Proper use of the hash trick enables a lot of possibilities, including augmentations to the techniques 
that we discussed (specifically, bag-of-words). References to different hashing implementations are 
included in the Further reading section at the end of this chapter. 
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Using feature selection techniques 


Now that we have a good selection of options for feature creation, as well as an understanding of the 
creative feature engineering possibilities, we can begin building our existing features into more 
effective variants. Given this new-found feature engineering skillset, we run the risk of creating 
extensive and hard-to-manage datasets. 


Adding features without limit increases the risk of model fragility and overfitting for certain types of 
models. This 1s tied to the complexity of the trends that you're attempting to model. In the simplest 
case, 1f you're attempting to identify a significant distinction between two large groups, your model 1s 
likely to support a large number of features. However, as the model you need to fit to make this 
distinction becomes more complex and as the group sizes that you have to work with become smaller, 
adding more and more features can harm the model's ability to classify consistently and effectively. 


This challenge is compounded by the fact that it isn't always obvious which parameter or variation 1s 
best-suited for the task. Suitability can vary by the underlying model; decision forests, for instance, 
don't perform any better with monotonic transformations (that is, transformations that retain the initial 
ordering of data cases; one example 1s log-scaling) than with the unscaled base data; however, for 
other algorithms, the choice to rescale and the rescaling method used are both very impactful choices. 


Traditionally, the quantity of features and limits on the parameter amount were tied to the desire to 
develop a mathematical function that relates key inputs to the desired outcome scores. In this context, 
additional parameters needed to be incorporated as moving or nuisance variables. 


Each new parameter introduces another dimension that makes the modeled relationship more complex 
and the resultant model more likely to be overfitting the data that exists. A trivial example 1s if you 
introduce a parameter that is just a unique label for each case; at this point, your algorithm will just 
learn those labels, making it very likely that your model fails entirely when introduced to a new 
dataset. 


Less trivial examples are no less problematic; the proportion of cases to features becomes very 
important when your features are separating cases down to very small groups. In short, increasing the 
complexity of the modeled function causes your model to be more liable to overfit and adding 
features can exacerbate this effect. According to this principle, we should be beginning with very 
small datasets and adding parameters only after justifying that they improve the model. 


However, in recent times, an opposing methodology—now generally seen as being part of a common 
way of doing data science—has gained ground. This methodology suggests that it's a good idea to 
add very large feature sets to incorporate every potentially valuable feature and work down to a 
smaller feature set that does the job. 


This methodology is supported by techniques that enable decisions to be made over huge feature sets 
(potentially hundreds or thousands of features) and that tend to operate in a brute force manner. These 
techniques will exhaustively test feature combinat ér8°funning models in series or in parallel until 
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the most effective parameter subsets are identified. 


These techniques work, which is why this methodology has become popular. It 1s definitely worth 
knowing about these techniques, if not using them, so you'll be learning how to apply them later in this 
chapter. 


The main disadvantage around using brute force techniques for feature selection 1s that it becomes 
very easy to trust the outcomes of the algorithm, irrespective of what the features it selects actually 
mean. It is sensible to balance the use of highly effective, black-box algorithms against domain 
knowledge and an understanding of what's being undertaken. Therefore, this chapter will enable you 
to use techniques from both paradigms (build up and build down) so that you can adapt to different 
contexts. We'll begin by learning how to narrow down the feature set that you have to work with, from 
many features to the most valuable subset. 


Performing feature selection 


Having built a large dataset, often the next challenge one faces 1s how to narrow down the options to 
retain only the most effective data. In this section, we'll discuss a variety of techniques that support 
feature selection, working by themselves or as wrappers to familiar algorithms. 


These techniques include correlation analysis, regularization techniques, and Recursive Feature 
Elimination (RFE). When we're done, you'll be able to confidently use these techniques to support 
your selection of feature sets, potentially saving yourself a significant amount of work every time you 
work with a new dataset! 


Correlation 


We'll begin our discussion of feature selection by looking for a simple source of major problems for 
regression models: multicollinearity. Multicollinearity is the fancy name for moderate or high degrees 
of correlation between features 1n a dataset. An obvious example is how pizza slice count is collinear 
with pizza price. 


There are two types of multicollinearity: structural and data-based. Structural multicollinearity occurs 
when the creation of new features, such as feature f7 from feature f, creates multiple features that may 
be highly correlated with one another. Data-based multicollinearity tends to occur when two 
variables are affected by the same causative factor. 


Both kinds of multicollinearity can cause some unfortunate effects. In particular, our models' 
performance tends to become affected by which feature combinations are used; when collinear 
features are used, the performance of our model will tend to degrade. 


In either case, our approach is simple: we can test for multicollinearity and remove underperforming 
features. Naturally, underperforming features are ones that add very little to model performance. They 
might be underperforming because they replicate information available in other features, or they may 
simply not provide data that is meaningful to the problem at hand. There are multiple ways to test for 
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weak features as many feature selection techniques will sift out multicollinear feature combinations 
and recommend their removal if they're underperformant. 


In addition, there 1s a specific multicollinearity test that's worth considering; namely, inspecting the 
eigenvalues of our data's correlation matrix. Eigenvectors and eigenvalues are fundamental concepts 
in the matrix theory with many prominent applications. More details are given at the end of this 
chapter. For now, suffice it to say that eigenvalues in the correlation matrix generated by our dataset 
provide us with a quantified measure of multicollinearity. Consider a set of eigenvalues as indicative 
of how much "new information content" our features bring to the dataset; a low eigenvalue suggests 
that the data may be correlated with other features. For an example of this at work, consider the 
following code, which creates a feature set and then adds collinearity to features 0, 2, and 4: 


import numpy as np 


x = np.Francdom.randn (100, 5) 
noise = np.random.randn (100) 
Siege = 7 * Seen] + fo * xls,2)] + «80 ~ MOucee 


When we generate the correlation matrix and compute eigenvalues, we find the following: 


corr = np.corrcoef (x, rowvar=0) 
w, V = np.linalg.eig(corr) 


print('eigenvalues of features in the dataset x") 
print (w) 


eigenvalues of features in the dataset x 
[ 0.00716428 1.94474029 1.30385565 0.74699492 0.99724486] 


Clearly, our Oth feature is suspect! We can then inspect the eigenvalues of this feature via calling v: 


print('eigenvalues of eigenvector 0") 
Prana Ls, Ol) 


eigenvalues of eigenvector 0 
[=0'.35603659 =—0.00853105 =—0.62463305 0.00959048. 0.694607 16] 


From the small values of features 1n position one and three, we can tell that features 2 and 4 are 
highly multicollinear with feature 0. We ought to remove two of these three features before 
proceeding! 


LASSO 


Regularized methods are among the most helpful feature selection techniques as they provide sparse 
solutions: ones where weaker features return zero, leaving only a subset of features with real 
coefficient values. 


The two most used regularization models are L1] and L2 regularization, referred to as LASSO and 
ridge regression respectively 1n linear regression, contexts. 
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Regularized methods function by adding a penalty to the loss function. Instead of minimizing a loss 
function E(x, Y), the penalty leads to E(X, Y) + a||w||. The hyperparameter a relates to the amount of 
regularization (enabling us to tune the strength of our regularization and thus the proportion of the 
original feature set that is selected). 


In LASSO regularization, the specific penalty function used 1s a) ni=/|wi|. Each non-zero coefficient 
adds to the size of the penalty term, forcing weaker features to return coefficients of 0. Selecting an 
appropriate penalty term can be achieved using scikit-learn's parameter optimization support for 
hyperparameters. In this case, we'll be using estimator.get params() to perform a grid search for 
appropriate hyperparameter values. For more information on how grid searches operate, see the 
Further reading section at the end of this chapter. 


In scikit-learn, logistic regression 1s provided with an L1 penalty for classification. Meanwhile, the 
LASSO module is provided for linear regression. For now, let's begin by applying LASSO to an 
example dataset. In this case, we'll use the Boston housing dataset: 


Fromsklearn., linear model amport: lasso 
fromsklearn.preprocessing import StandardScaler 
EromskiG@ari.datasets @amport Load boston 


boston = toad boston () 

scaler = StandardScaler () 

m= SCaler.716. Transrorm (boston “data” | ) 
Y = boston["target"|] 

Hames. = bOSTOn|"Tearure names” | 

lasso = Lasso(alpha=.3) 


lasso.fit(X, Y) 
Print “basso models “, Pretty print 1inear (lasso.coer , Names, sort = True) 


Lasso model: -3.707 * LSTAT + 2.992 * RM + -1.757 * PTRATIO + -1.081 * DIS + 
-QO.7 * NOX + 0.631 * B+ 0.54 * CHAS + -0.236 * CRIM + 0.081 * ZN + -0.0 * INDUS 
+ -Q0.0 * AGE + 0.0 * RAD + -0.0 * TAX 


Several of the features in the original set returned a correlation of 0. 0. Increasing the correlation 
makes the solution increasingly sparse. For instance, we see the following results when alpha = 
0.4: 


Lasso model: -3.707 * LSTAT + 2.992 * RM + -1.757 * PTRATIO + -1.081 * DIS + 
—-O0.7 * NOX + 0.631 * B+ 0.54 * CHAS + -0.236 * CRIM + Q.081 * ZN + -0.0 * INDUS 
+ -Q0.0 * AGE + 0.0 * RAD + -0.0 * TAX 


We can immediately see the value of LI regularization as a feature selection technique. However, it is 
important to note that LI regularized regression 1s unstable. Coefficients can vary significantly, even 
with small data changes, when features in the data are correlated. 


This problem is effectively addressed with L2 regularization, or ridge regression, which develops a 
feature coefficient with different applications WQ"nereadlization adds an additional penalty, the L2 
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norm penalty, to the loss function. This penalty takes the form (a) ni=1w2i). A sharp-eyed reader will 
notice that, unlike the L1 penalty (a) ni=I|wil), the L2 penalty uses squared coefficients. This causes 
the coefficient values to be spread out more evenly and has the added effect that correlated features 
tend to receive similar coefficient values. This significantly improves stability as the coefficients no 
longer fluctuate on small data changes. 


However, L2 normalization isn't as directly useful for feature selection as L1. Rather, as interesting 
features (with predictive power) tend to have non-zero coefficients, L2 is more useful as an 
exploratory tool allowing inference about the quality of features in the classification. It has the added 
merit of being more stable and reliable than L1 regularization. 


Recursive Feature Elimination 


RFE is a greedy, iterative process that functions as a wrapper over another model, such as an SVM 
(SVM-RFE), which it repeatedly runs over different subsets of the input data. 


As with LASSO and ridge regression, our goal 1s to find the best-performing feature subset. As the 
name suggests, on each iteration a feature 1s set aside allowing the process to be repeated with the 
rest of the feature set until all features in the dataset have been eliminated. The ordering with which 
features are eliminated becomes their rank. After multiple iterations with incrementally smaller 
subsets, each feature 1s accurately scored and relevant subsets can be selected for use. 


To get a better understanding of how this works, let's look at a simple example. We'll use the (by now 
familiar) digits dataset to understand how this approach works 1n practice: 


Drint(t doc } 


from sklearn.svm import SVC 
fromskilearnsdatasees amporl load Gdigits 
Troms kiceari.teature: selecrion inpore. RFE 
LMpOrimiatp_LOULID.~pyYpLoce as DLE 


d2Gies = sbOad cgi. s () 
X = digits.images.reshape((len(digits.images), -1)) 
y = digits.target 


We'll use an SVM as our base estimator via the svc operator for Support Vector Classification 
(SVC). We then apply the RFE wrapper over this model. RFE takes several arguments, with the first 
being a reference to the estimator of choice. The second argument 1s n features to select, which 
is fairly self-explanatory. In cases where the feature set contains many interrelated features whose 
subsets possess multivariate distributions that are highly effective classification features, it's possible 
to opt for feature combinations of two or more. 


Stepping enables the removal of multiple features on each iteration. When given a value between 0.0 
and /.0, each step enables the removal of a percentage of the feature set, corresponding to the 
proportion given in the step argument: 
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Cie = REMC OL NeeOra=o re, M. -eatUres OO Selece=|, Seo —l) 
rfe.fit(X, y) 
tanking = £re.ranking «reshape (d1g1ls.1mages|U) sshape) 


plt.matshow (ranking) 

DIC .COlLOrDpar() 

plt.title("Ranking of pixels with RFE") 
pilt.show() 


Given that we're familiar with the digits dataset, we know that each instance is an 8 x 8 image of a 
handwritten digit, as shown in the following image. Each image is located in the center of the 8 x 8 


orid: 


0) 
1 
2 
3 
A 
5 
6 
I 


0 123 45 6 F 





When we apply RFE over the digits dataset, we can see that 1t broadly captures this information in 
applying a ranking: 
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Ranking of pixels with RFE 
| | é 3 4 = 6 





The first pixels to be cut were 1n and around the (typically empty) vertical edges of the image. Next, 
the algorithm began culling normally whitespace areas around the vertical edges or near the top of the 
image. The pixels that were retained longest were those that enabled the most differentiation between 
the different characters—pixels that would be present for some numbers and not for others. 


This example gives us great visual confirmation that RFE works. What it doesn't give us 1s evidence 
for how consistently the technique works. The stability of RFE 1s dependent on the stability of the 
base model and, 1n some cases, ridge regression will provide a more stable solution. (For more 
information on which cases and the conditions involved, consult the Further reading section at the 
end of this chapter. ) 


Genetic models 


Earlier in this chapter, we discussed the existence of algorithms that enable feature selection with 
very large parameter sets. Some of the most prominent techniques of this type are genetic algorithms, 
which emulate natural selection to generate increasingly effective models. 


A genetic solution for feature selection works roughly as follows: 


e An initial set of variables (predictors 1s the term typically used 1n this context) are combined 
into multiple subsets (candidates) and a performance measure is calculated for each candidate 

e The predictors from candidates with the best performance are randomly recombined into a new 
iteration (generation) of models 

e During this recombination step, for each subset there 1s the probability of a mutation, whereby a 


predictor may be added or removed font PSuOSS8 org 


This algorithm typically iterates for multiple generations. The appropriate iteration amount is 
dependent on the complexity of the dataset and the model required. As with gradient descent 
techniques, the typical relationship between the performance and iteration count 1s present for genetic 
algorithms, where performance improvement declines nonlinearly as the count of iterations increases, 
eventually hitting a minimum before the overfitting risk increases. 


To find an effective iteration count, we can perform testing using training data; by running the model 
for a large number of iterations and plotting the Root Mean Squared Error (RMSE), we're able to 
find an appropriate amount of iterations given our input data and model configuration. 


Let's talk in a little more detail about what happens within each generation. Specifically, let's talk 
about how candidates are created, how performance is scored, and how recombination 1s performed. 


The candidates are initially configured to use a random sample of the available predictors. There 1s 
no hard and fast rule concerning how many predictors to use in the first generation; it depends on how 
many features are available, but it's common to see first generation candidates using 50% to 80% of 
the available features (with a smaller percentage used 1n cases with more features). 


The fitness measure can be difficult to define, but a common practice 1s to use two forms of cross- 
validation. Internal cross-validation (testing each model solely 1n the context of its own parameters 
without comparing models) is typically used to track performance at a given iteration; the fitness 
measures from internal cross-validation are used to select models to recombine in the next generation. 
External cross-validation (testing against a dataset that was not used 1n validation at any iteration) 1s 
also needed 1n order to confirm that the search process produced a model that has not overfitted to the 
internal training data. 


Recombination is controlled by three key parameters: mutation, cross-over probabilities, and elitism. 
The latter 1s an optional parameter that one may use to reserve n-many of the top-performing models 
from the current generation; by doing so, one may preserve particularly effective candidates from 
being lost entirely during recombination. This can be done while also using that candidate in mutated 
variants and/or using them as parents to next-generation candidates. 


The mutation probability defines the chance of a next-generation model being randomly readjusted 
(via some predictors, typically one, being added or removed). Mutation tends to help the genetic 
algorithm maintain a broad coverage of the candidate variables, reducing the risk of falling into a 
parameter-local solution. 


Cross-over probability defines the likelihood that a pair of candidates will be selected for 
recombination into a next-generation model. There are several cross-over algorithms: parts of each 
parent's feature set might be spliced (for example, first half/second half) into the child or a random 
selection of each parent's features might be used. Common features to both parents might also be used 
by default. Random sampling from the set of both parent's unique predictors 1s a common default 
approach. 
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These are the main parts of a general genetic algorithm, which can be used as a wrapper to existing 
models (logistic regression, SVM, and others). The technique described here can be varied in many 
different ways and is related to feature selection techniques used slightly differently across multiple 
quantitative fields. Let's take the theory that we've covered thus far and start applying it to a practical 
example. 
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Feature engineering in practice 


Depending on the modeling technique that you're using, some of this work may be more valuable than 
other parts. Deep learning algorithms tend to perform better on less-engineered data than shallower 
models and it might be that less work 1s needed to improve results. 


The key to understanding what is needed 1s to iterate quickly through the whole process from dataset 
acquisition to modeling. On a first pass with a clear target for model accuracy, find the acceptable 
minimum amount of processing and perform that. Learn whatever you can about the results and make a 
plan for the next iteration. 


To show how this looks in practice, we'll work with an unfamiliar, high-dimensional dataset, using an 
iterative process to generate increasingly effective modeling. 


I was recently living in Vancouver. While it has many positive qualities, one of the worst things about 
living in the city was the somewhat unpredictable commute. Whether I was traveling by car or taking 
Translink's Skytrain system (a monorail-meets-rollercoaster high-speed line), I found myself subject 
to hard-to-predict delays and congestion issues. 


In the spirit of putting our new feature engineering skillset into practice, let's take a look at whether 
we can improve this experience by taking the following steps: 


e Writing code to harvest data from multiple APIs, including text and climate streams 
e Using our feature engineering techniques to derive variables from this initial data 
e Testing our feature set by generating commute delay risk scores 


Unusually, in this example, we'll focus less on building and scoring a highly performant model. 
Instead, our focus 1s on creating a self-sufficient solution that you can adjust and apply for your own 
local area. While it suits the goals of the current chapter to take this approach, there are two 
additional and important motivations. 


Firstly, there are some challenges around sharing and making use of Twitter data. Part of the terms of 
use of Twitter's API is an obligation on the developer to ensure that any adjustments to the state of a 
timeline or dataset (including, for instance, the deletion of a tweet) are reproduced 1n datasets that are 
extracted from Twitter and publicly shared. This makes the inclusion of real Twitter data 1n this 
chapter's GitHub repository impractical. Ultimately, this makes 1t difficult to provide reproducible 
results from any downstream model based on streamed data because users will need to build their 
own stream and accumulate data points and because variations 1n context (such as seasonal 
variations) are likely to affect model performance. 


The second element here 1s simple enough: not everybody lives in Vancouver! In order to generate 
something of value to an end user, we should think in terms of an adjustable, general solution rather 


than a geographically-specific one. 
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The code presented in the next section 1s therefore intended to be something to build from and 
develop. It offers potential as the basis of a successful commercial app or simply a useful, data- 
driven life hack. With this in mind, review this chapter's content (and leverage the code in the 
associated code directory) with an eye to finding and creating new applications that fit your own 
situation, locally available data, and personal needs. 
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Acquiring data via RESTful APIs 


In order to begin, we're going to need to collect some data! We're going to need to look for rich, 
timestamped data that is captured at sufficient frequency (preferably at least one record per commute 
period) to enable model training. 


A natural place to begin with 1s the Twitter API, which allows us to harvest recent tweet data. We can 
put this API to two uses. 


Firstly, we can harvest tweets from official transit authorities (specifically, bus and train companies). 
These companies provide transit service information on delays and service disruptions that, helpfully 
for us, takes a consistent format conducive to tagging efforts. 


Secondly, we can tap into commuter sentiment by listening for tweets from the geographical area of 
interest, using a customized dictionary to listen for terms related to cases of disruption or the causes 
thereof. 


In addition to mining the Twitter API for data to support our model, we can leverage other APIs to 
extract a wealth of information. One particularly valuable source of data is the Bing Traffic API. 
This API can be easily called to provide traffic congestion or disruption incidents across a user- 
specified geographical area. 


In addition, we can leverage weather data from the Yahoo Weather APL. This API provides the 
current weather for a given location, taking zip codes or location input. It provides a wealth of local 
climate information including, but not limited to, temperature, wind speed, humidity, atmospheric 
pressure, and visibility. Additionally, it provides a text string description of current conditions as 
well as forecast information. 


While there are other data sources that we can consider tying into our analysis, we'll begin with this 
data and see how we do. 


Testing the performance of our model 


In order to meaningfully assess our commute disruption prediction attempt, we should try to define 
test criteria and an appropriate performance score. 


What we're attempting to do is identify the risk of commute disruption on the current day, each day. 
Preferably, we'd like to know the commute risk with sufficient advance notice that we can take action 
to mitigate 1t (for example, by leaving home earlier). 


In order to do this, we're going to need three things: 


e An understanding of what our model is going to output 
e A measure we can use to quantify model performance 


e Some target data we can use to score modelwpertammnance according to our measure 
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We can have an interesting discussion about why this matters. It can be argued, effectively, that some 
models are information in purpose. Our commute risk score, it might be said, is useful insofar as it 
generates information that we didn't previously have. 


The reality of the situation, however, 1s that there is inalienably going to be a performance criterion. 
In this case, it might simply be my satisfaction with the results output by my model, but it's important 
to be aware that there is always some performance criterion at play. Quantifying performance 1s 
therefore valuable, even in contexts where a model appears to be informational (or even better, 
unsupervised). This makes it prudent to resist the temptation to waive performance testing; at least 
this way, you have a quantified performance measure to iteratively improve on. 


A sensible starting point is to assert that our model 1s intended to output a numerical score ina 0-/ 
range for outbound (home to work) commutes on a given day. We have a few options with regard to 
how we present this score; perhaps the most obvious option would be to apply a log rescaling to the 
data. There are good reasons to log-scale and in this situation 1t might not be a bad idea. (It's not 
unlikely that the distribution of commute delay time obeys a power law.) For now, we won't reshape 
this set of scores. Instead, we'll wait to review the output of our model. 


In terms of delivering practical guidance, a 0-/ score isn't necessarily very helpful. We might find 
ourselves wanting to use a bucketed system (such as high risk, mid risk, or low risk) with bucket 
boundaries at specific boundaries 1n the 0-/ range. In short, we would transition to treating the 
problem as a multiclass classification problem with categorical output (class labels), rather than as a 
regression problem with a continuous output. 


This might improve model performance. (More specifically, because it'll increase the margin of free 
error to the full breadth of the relevant bucket, which 1s a very generous performance measure. ) 
Equally though, it probably isn't a great idea to introduce this change on the first iteration. Until we've 
reviewed the distribution of real commute delays, we won't know where to draw the boundaries 
between classes! 


Next, we need to consider how we measure the performance of our model. The selection of an 
appropriate scoring measure generally depends on the characteristics of the problem. We're presented 
with a lot of options around classifier performance scoring. (For more information around 
performance measures for machine learning algorithms, see the Further reading section at the end of 
this chapter. ) 


One way of deciding which performance measure 1s suitable for the task at hand 1s to consider the 
confusion matrix. A confusion matrix is a table of contingencies; in the context of statistical modeling, 
they typically describe the label prediction versus actual labels. It is common to output a confusion 
matrix (particularly for multiclass problems with more classes) for a trained model as it can yield 
valuable information about classification failures by failure type and class. 


In this context, the reference to a confusion matrix 1s more illustrative. We can consider the following 
simplified matrix to assess whether there is anyoya@nisagency that we don't care about: 
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Actual Result = 
FALSE 


Prediction TRUE | True Positive False Positive 
False Negative True Negative 





In this case, we care about all four contingency types. False negatives will cause us to be caught in 
unexpected delays, while false positives will cause us to leave for our commute earlier than 
necessary. This implies that we want a performance measure that values both high sensitivity (true 
positive rate) and high specificity (false positive rate). The ideal measure, given this, is area under 
the curve (AUC). 


The second challenge is how to measure this score; we need some target against which to predict. 
Thankfully, this is quite easy to obtain. I do, after all, have a daily commute to do! I simply began 
self-recording my commute time using a stopwatch, a consistent start time, and a consistent route. 


It's important to recognize the limitations of this approach. As a data source, I am subject to my own 
internal trends. I am, for instance, somewhat sluggish before my morning coffee. Similarly, my own 
consistent commute route may possess local trends that other routes do not. It would be far better to 
collect commute data from a number of people and a number of routes. 


However, in some ways, I was happy with the use of this target data. Not least because I am 
attempting to classify disruption to my own commute route and would not want natural variance 1n my 
commute time to be misinterpreted through training, say, against targets set by some other group of 
commuters or routes. In addition, given the anticipated slight natural variability from day-to-day, 
should be disregarded by a functional model. 


It's rather hard to tell what's good enough in terms of model performance. More precisely, it's not easy 
to know when this model 1s outperforming my own expectations. Unfortunately, not only do I not have 
any very reliable with regard to the accuracy of my own commute delay predictions, it also seems 
unlikely that one person's predictions are generalizable to other commutes in other locations. It seems 
ill-advised to train a model to exceed a fairly subjective target. 


Let's instead attempt to outperform a fairly simple threshold—a model that naively suggests that every 
single day will not contain commute delays. This target has the rather pleasing property of mirroring 
our actual behavior (in that we tend to get up each day and act as though there will not be transit 
disruption). 


Of the 85 target data cases, 14 commute delays were observed. Based on this target data and the 
score measure we created, our target to beat is therefore 0.5. 
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Given that we're focusing this example analysis on the city of Vancouver, we have an opportunity to 
tap into a second Twitter data source. Specifically, we can use service announcements from 
Vancouver's public transit authority, Translink. 


Translink Twitter 


As noted, this data 1s already well-structured and conducive both to text mining and subsequent 
analysis; by processing this data using the techniques we reviewed in the last two chapters, we can 
clean the text and then encode it into useful features. 


We're going to apply the Twitter API to harvest Translink's tweets over an extended period. The 
Twitter API is a pretty friendly piece of kit that 1s easy enough to work with from Python. (For 
extended guidance around how to work with the Twitter API, see the Further reading section at the 
end of this chapter!) In this case, we want to extract the date and body text from the tweet. The body 
text contains almost everything we need to know, including the following: 


e The nature of the tweet (delay or non-delay) 
e The station affected 
e Some information as to the nature of the delay 


One element that adds a little complexity is that the same Translink account tweets service disruption 
information for Skytrain lines and bus routes. Fortunately, the account is generally very uniform in the 
terms that it uses to describe service issues for each service type and subject. In particular, the 
Twitter account uses specific hashtags (#RiderAlert for bus route information, #SkyTrain for train- 
related information, and #TransitAlert for general alerts across both services, such as statutory 
holidays) to differentiate the subjects of service disruption. 


Similarly, we can expect a delay to always be described using the word delay, a detour by the term 
detour, and a diversion, using the word diversion. This means that we can filter out unwanted tweets 
using specific key terms. Nice job, Translink! 


Note 


The data used in this chapter is provided within the GitHub solution accompanying this chapter in the 
translink tweet data.4json file. The scraper script 1s also provided within the chapter code; 1n 
order to leverage it, you'll need to set up a developer account with Twitter. This 1s easy to achieve; 
the process 1s documented here and you can sign up here. 


Once we've obtained our tweet data, we know what to do next—we need to clean and regularize the 
body text! As per Chapter 6, Jext Feature Engineering, let's run BeautifulSoup and NLTK over the 
input data: 


from bs4 import BeautifulSoup 
tweets = BeautifulSoup(train["TranslinkTweets.text"]) 


tweettext = tweets.get text () WOW! eBook 
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DOW 2. = uth. cCOrous.DEOWN.taggea, Sent Ss (Caleqgorics= a") 


tagger = None 
for n in range(1,4): 
tagger = Noremlagqgerin,; Drown. ad, Dbackort = Tagger) 


taggedtweettext = tagger.tag (tweettext) 


We probably will not need to be as intensive in our cleaning as we were with the troll dataset 1n the 
previous chapter. Translink's tweets are highly formulaic and do not include non-ascii characters or 
emoticons, so the specific "deep cleaning" regex script that we needed to use in Chapter 6, 7ext 
Feature Engineering, won't be needed here. 


This gives us a dataset with lower-case, regularized, and dictionary-checked terms. We are ready to 
start thinking seriously about what features we ought to build out of this data. 


We know that the base method of detecting a service disruption issue within our data is the use ofa 
delay term in a tweet. Delays happen in the following ways: 


Ata given location 
At a given time 

For a given reason 
For a given duration 


Each of the first three factors 1s consistently tracked within Translink tweets, but there are some data 
quality concerns that are worth recognizing. 


Location is given in terms of an affected street or station at 22nd Street. This isn't a perfect 
description for our purpose as we're unlikely to be able to turn a street name and route start/end 
points into a general affected area without doing substantial additional work (as no convenient 
reference exists that allows us to draw a bounding box based on this information). 


Time is imperfectly given by the tweet datetime. While we don't have visibility on whether tweets are 
made within a consistent time from service disruption, it's likely that Translink has targets around 
service notification. For now, it's sensible to proceed under the assumption that the tweet times are 
likely to be sufficiently accurate. 


The exception is likely to be for long-running issues or problems that change severity (delays that are 
expected to be minor but which become significant). In these cases, tweets may be delayed until the 
Translink team recognizes that the issue has become tweet-worthy. The other possible cause of data 
quality issues 1s inconsistency in Translink's internal communications; it's possible that engineering or 
platform teams don't always inform the customer service notifications team at the same speed. 


We're going to have to take a certain amount on faith though, as there isn't a huge amount we can do to 
measure these delay effects without a dataset of real-time, accurate Translink service delays. (If we 
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Reasons for Skytrain service delays are consistently described by Translink and can fall into one of 
the following categories: 


Rail 
Train 
Switch 
Control 
Unknown 
Intrusion 
Medical 
Police 
Power 


With each category described within the tweet body using the specific proper term given in the 
preceding list. Obviously, some of these categories (Police, Power, Medical) are less likely to be 
relevant as they wouldn't tell us anything useful about road conditions. The rate of train, track, and 
switch failure may be correlated with detour likelihood; this suggests that we may want to keep those 
cases for classification purposes. 


Meanwhile, bus route service delays contain a similar set of codes, many of which are very relevant 
to our purposes. These codes are as follows: 


Motor Vehicle Accident (MVA) 
Construction 

Fire 

Watermain 

Traffic 


Encoding these incident types is likely to prove useful! In particular, it's possible that certain service 
delay types are more impactful than others, increasing the risk of a longer service delay. We'll want to 
encode service delay types and use them as parameters in our subsequent modeling. 


To do this, let's apply a variant of one-hot encoding, which does the following: 


e It creates a conditional variable for each of the service risk types and sets all values to zero 
e It checks tweet content for each of the service risk type terms 
e It sets the relevant conditional variable to 1 for each tweet that contains a specific risk term 


This effectively performs one-hot encoding without taking the bothersome intermediary step of 
creating the factorial variable that we'd normally be processing: 


from sklearn import preprocessing 


enc = Dreproces>sin¢d.OnehoOerncoder (calegoritcal tearures="all’, Glype= "tlcat’; 
handle UNnKnOwn="error’, W Values="aUulo"» Sparse—i7ue) 


tweets.delayencode = enc.transform (twomwlsBooklaytype) .toarray () 
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Beyond what we have available to use as a feature on a per-incident basis, we can definitely look at 
the relationship between service disruption risk and disruption frequency. If we see two disruptions 
in a week, 1s a third more likely or less likely? 


While these questions are interesting and potentially fruitful, it's usually more prudent to work up a 
limited feature set and simple model on a first pass than to overengineer a sprawling feature set. As 
such, we'll run with the initial incidence rate features and see where we end up. 


Consumer comments 


A major cultural development in 2010 was the widespread use of public online domains for self- 
expression. One of the happier products of this is the availability of a wide array of self-reported 
information on any number of subjects, provided we know how to tap into this. 


Commute disruptions are frequently occurring events that inspire a personal response, which means 
that they tend to be quite broadly reported on social media. If we write an appropriate dictionary for 
key-term search, we can begin using Twitter particularly as a source of timestamped information on 
traffic and transit issues around the city. 


In order to collect this data, we'll make use of a dictionary-based search approach. We're not 
interested in the majority of tweets from the period in question (and as we're using the RESTful API, 
there are return limits to consider). Instead, we're interested 1n identifying tweet data containing key 
terms related to congestion or delay. 


Unfortunately, tweets harvested from a broad range of users tend not to conform to consistent styles 
that aid analysis. We're going to have to apply some of the techniques we developed 1n the preceding 
chapter to break down this data into a more easily analyzed format. 


In addition to using a dictionary-based search, we could do some work to narrow the search area 
down. The most authoritative way to achieve this is to use a bounding box of coordinates as an 
argument to the Twitter API, such that any related query exclusively returns results gathered from 
within this area. 


As always, on our first pass, we'll keep things simple. In this case, we'll count up the number of 
traffic disruption tweets 1n the current period. There is some additional work that we could benefit 
from doing with this data on subsequent iterations. Just as the Translink data contained clearly- 
defined delay cause categories, we could try to use specialized dictionaries to isolate delay types 
based on key terms (for example, a dictionary of construction-related terms and synonyms). 


We could also look at defining a more nuanced quantification of disruption tweet rate than a simple 
count of recent. We could, for instance, look at creating a weighted count feature that increases the 
impact of multiple simultaneous tweets (potentially indicative of severe disruption) via a nonlinear 


weighting. 


The Bing Traffic API 
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The next API we're going to tap into 1s the Bing Traffic API. This API has the advantage of being 
easily accessed; it's freely available (whereas some competitor APIs sit behind paywalls), returns 
data, and provides a good level of detail. Among other things, the API returns an incident location 
code, a general description of the incident, together with congestion information, an incident type 
code, and start/end timestamps. 


Helpfully, the incident type codes provided by this API describe a broad set of incident types, as 
follows: 


ad 


Accident. 
Congestion. 
DisabledVehicle. 
MassTransit. 
Miscellaneous. 
OtherNews. 
PlannedEvent. 
RoadHazard. 
Construct ton, 
Alert. 
Weather. 


Se oe ee 


ee 


Additionally, a severity code 1s provided with the severity values translated as follows: 


1. Lowlmpact. 
Minor. 


Moderate. 


— 


Serious, 


One downside, however, is that this API doesn't receive consistent information between regions. 
Querying in France, for instance, returns codes from multiple other incident types, (I observed 1, 3, 5, 
8 for a town in northern France over a period of one month.) but doesn't seem to show every code. In 
other locations, even less data 1s available. Sadly, Vancouver tends to show data for codes 9 or 5 
exclusively, but even the miscellaneous-coded incidents appear to be construction-related: 


Closed between Victoria Dr and Commercial Dr - Closed. Construction work. 5 


This 1s a somewhat bothersome limitation. Unfortunately, it's not something that we can easily fix; 
Bing's API 1s simply not sourcing all of the data that we want! Unless we pay for a more complete 
dataset (or an API with fuller data capture is available 1n your area!), we're going to need to keep 
working with what we have. 


An example of querying this API is as follows: 


importurllib.request, urllib.error, urllib.parse 
import json WOW! eBook 
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latN = str(49.310911) 


( 
latS = str(49.201444) 
lonW = str(-123.225544) 
lone. = stUr{(=—-122..905931) 
url = 


"http://dev.virtualearth.net/REST/vl/Traffic/Incidents/'t+latS+', '+lonw+', '+latN+ 
', 'tlonk+' ?key='GETYOUROWNKEYPLEASE' 


response = urllib.request.urlopen(url).read() 
data = json.loads(response.decode('utf8')) 
resources = datal['resourceSets'][0]['resources'] 


print ('PRETTIFIED RESULTS') 
| aa a a aaa ol a lee ar ad iene : 
for resourceItem in resources: 

description = resourcelItem['description'] 
typeof = resourcelItem['type'] 

start = resourcelItem['start'] 

end = resourcelItem['end' ] 
Print ("“Gescripeion."*, CesCrapLr1on) ; 
print('type:', typeof); 
Prine{ Sstarttime:*;, Start); 
print ('endtime:', end); 
Prine: 


This example yields the following data; 


description: Closed between Boundary Rd and PierviewCres - Closed due to 
roadwork. 

type: 9 

severity 4 

starttime: /Date (1458331200000) / 

endtime: /Date(1466283600000) / 

description: Closed between Commercial Dr and Victoria Dr - Closed due to 
roadwork. 

type: 9 

severity 4 

starttime: /Date (1458327600000) / 

endtime: /Date(1483218000000) / 


description: Closed between Victoria Dr and Commercial Dr - Closed. Construction 
work. 
type: 35 


severity 4 

starttime: /Date (1461780543000) / 

endtime: /Date (1481875140000) / WOW! eBook 
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description: At Thurlow St - Roadwork. 
type: 9 

severity 3 

Sstarttime: /Date(1461780537000) / 
endtime: /Date(1504112400000) / 


Even after recognizing the shortcomings of uneven code availability across different geographical 
areas, the data from this API should provide us with some value. Having a partial picture of traffic 
disruption incidents still gives us data for a reasonable period of dates. The ability to localize traffic 
incidents within an area of our own definition and returning data relevant to the current date is likely 
to help the performance of our model. 


Deriving and selecting variables using feature engineering techniques 


On our first pass over the input data, we repeatedly made the choice to keep our initial feature set 
small. Though we saw lots of opportunities in the data, we prioritized viewing an initial result above 
following up on those opportunities. 


It is likely, however, that our first dataset won't help us solve the problem very effectively or hit our 
targets. In this event, we'll need to iterate over our feature set, both by creating new features and 
winnowing our feature set to reduce down to the valuable outputs of that feature creation process. 


One helpful example involves one-hot encoding and RFE. In this chapter, we'll use one-hot to turn 
weather data and tweet dictionaries into tensors of m*n size. Having produced m-many new columns 
of data, we'll want to reduce the lability of our model to be misled by some of these new features 
(for instance, in cases where multiple features reinforce the same signal or where misleading but 
commonly-used terms are not cleaned out by the data cleaning processes we described in Chapter 6, 
Text Feature Engineering). This can be done very effectively by RFE, the technique for feature 
selection that we discussed earlier in this chapter. 


In general, it can be helpful to work using a methodology that applies the techniques seen in the last 
two chapters using an expand-contract process. First, use techniques that can generate potentially 
valuable new features, such as transformations and encodings, to expand the feature set. Then, use 
techniques that can identify the most performant subset of those features to remove the features that do 
not perform well. Throughout this process, test different target feature counts to identify the best 
available feature set at different numbers of features. 


Some data scientists interpret how this is done differently from others. Some will build all of their 
features using repeated iterations over the feature creation techniques we've discussed, then reduce 
that feature set—the motivation being that this workflow minimizes the risk of losing data. Others will 
perform the full process iteratively. How you choose to do this 1s entirely up to you! 


On our initial pass over the input data, then, we have a feature set that looks as follows: 
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‘DiLStupeitoninrOrmataon’ =: | 
‘Date’? *15—05-=2015", 
'TranslinkTwitter': [f{ 
Serv Lce’s yt, 
"DisruptioniIncidentCount!: '4' 
ty 4 
'oervice’? Fit, 
"DisruptioniIncidentCount!: 'Q' 
} | 
by 
‘Ping lParricArL* = 4 
‘Newline toOentCOunEL’ = *1', 
‘Severe lnec tcenrCount’s *1.7, 
LiIncrdentCoune’s "3! 


by 
'ConsumerTwitter': { 
'"DisruptionTweetCount!: '4'! 


It's unlikely that this dataset is going to perform well. All the same, let's run it through a basic initial 
algorithm and get a general idea as to how near our target we are; this way, we can learn quickly with 
minimal overhead! 


In the interest of expedience, let's begin by running a first pass using a very simple regression 
algorithm. The simpler the technique, the faster we can run it (and often, the more transparent it 1s to 
us what went wrong and why). For this reason (and because we're dealing with a regression problem 
with a continuous output rather than a classification problem), on a first pass we'll work witha 
simple linear regression model: 


From eklearn import. Lanear- model 

LWEerts “ train = Tweets x%).=Z0 | 

CWeels xX test = tweets Xi-2Z0s) 

CWSCLS “Y Train, = Cweets.targer | s=Z20 | 

EWeeLS VY Test = tweets. targeul-20. | 

regr = lanear model.linearkegression() 

FeGr.,f1C(Cweets x» [tain, Cweets y train) 

Prine eOer I eter. Vi; 2oor.coe. ) 

DETaOe("Rhesioudl Sul Of Squares? <@.26" «© Np.mMean( (Tegr.predice(iweels — tose) = 
EWECUsS 7 test) ** Z)) 


PrIne(Varience score: 2.21" « PeECr.ScCore (tweets KX Lest, tweets y test) ) 


PEt «SCatver (tweets © test, Gweets VY test, Color] black") 
PleaP LoL (tweers x» vest, Peg. prec cli tyects 2 test), Color blue”, 11 0ewi1orn=3) 
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plt.xticks (()) 
plt.yticks (()) 
plit.show () 


At this point, our AUC is pretty lousy; we're looking at a model with an AUC of 0.495. We're actually 
doing worse than our target! Let's print out a confusion matrix to see what this model's doing wrong: 


Prediction 
FALSE 


Actual 
Result 





According to this matrix, it's doing everything not very well. In fact, it's claiming that almost all of the 
records show no incidents, to the extent of missing 90% of real disruptions! 


This actually isn't too bad at all, given the early stage that we're at with our model and our features, as 
well as the uncertain utility of some of our input data. At the same time, we should expect an 
incidence rate of 6% (as our training data suggests that incidents have been seen to occur roughly 
once every 16 commutes). We'd still be doing a little better by guessing that every day will involve a 
disrupted commute (if we ignore the penalty to our lifestyle entailed by leaving home early each day). 


Let's consider what changes we could make in a next pass. 


1. First off, we could stand to improve our input data further. We identified a number of new 
features that we could create from existing sources using a range of transformation techniques. 

2. Secondly, we could look at extending our dataset using additional information. In particular, a 
weather dataset describing both temperature and humidity may help us improve our model. 

3. Finally, we could upgrade our algorithm to something with a little more grunt, random forests or 
SVM being obvious examples. There are good reasons not to do this just yet. The main reason 1s 
that we can continue to learn a lot from linear regression; we can compare against earlier results 
to understand how much value our changes are adding, while retaining a fast iteration loop and 
simple scoring methods. Once we begin to get minimal returns on our feature preparation, we 
should consider upgrading our model. 


For now, we'll continue to upgrade our dataset. We have a number of options here. We can encode 
location into both traffic incident data from the Bing API's "description" field and into Translink's 
tweets. In the case of Translink, this is likely to be more usefully done for bus routes than Skytrain 
routes (given that we restricted the scope of this analysis to focus solely on traffic commutes). 


We can achieve this goal in one of two ways; 


e Using a corpus of street names/locations“WW'ah? parse the input data and build a one-hot matrix 
www.wowebook.org 


e We can simply run one-hot encoding over the entire body of tweets and entire set of API data 


Interestingly, 1f we intend to use dimensionality reduction techniques after performing one-hot 
encoding, we can encode the entire body of both pieces of text information without any significant 
problems. If features relating to the other words used in tweets and text are not relevant, they'll simply 
be scrubbed out during RFE. 


This 1s a slightly laissez-faire approach, but there is a subtle advantage. Namely, if there is some 
other potentially useful content to either data source that we've so far overlooked as a potential 
feature, this process will yield the added benefit of creating features based on that information. 


Let's encode locations in the same way we encoded delay types: 
from sklearn import preprocessing 


enc: = Drepreocecsing.OnenoLbicoder (Calegoritcal Tearvuresc='al*, Clype—= "Tica, 
bandle UNKnOWwn="SrroLr’, mm Values—="auto”, sparse—Irue) 


tweets.delayencode = enc.transform(tweets.location).toarray() 


Additionally, we should follow up on our intention to create recent count variables from Translink 
and Bing maps incident logging. The code for this aggregation 1s available in the GitHub repository 
accompanying this chapter! 


Rerunning our model with this updated data produced results with a very slight improvement; the 
predicted variance score rose to 0.56. While not dramatic, this is definitely a step in the right 
direction. 


Next, let's follow up on our second option—adding a new data source that provides weather data. 


The weather API 


We've previously grabbed data that will help us tell whether commute disruption is happening— 
reactive data sources that identify existing delays. We're going to change things up a little now, by 
trying to find data that relates to the causes of delays and congestion. Roadworks and construction 
information definitely falls into this category (along with some of the other Bing Traffic API codes). 


One factor that 1s often (anecdotally!) tied to increased commute time 1s bad weather. Sometimes this 
is pretty obvious; heavy frost or high winds have a clear impact on commute time. In many other 
cases, though, it's not clear what the strength and nature of the relationship between climatic factors 
and disruption likelihood 1s for a given commute. 


By extracting pertinent weather data from a source with sufficient granularity and geo coverage, we 
can hopefully use strong weather signals to help improve our correct prediction of disruption. 


For our purposes, we'll use the Yahoo Weather API, which provides a range of temperature, 
atmospheric, pressure-related, and other climat€ Wathoboth current and forecasted. We can query the 
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Yahoo Weather API without needing a key or login process, as follows: 


hore Wilkie? Wala: aon 
baseurl = https://query.yahooapis.com/vl/public/yql? 


Vol. -query = “select 1tem, conection From weather.Torecast where woer0=950 7" 
VolLUelL = Deaseurl + Upilip.urbencode (1 "q” ?yql.-query)) + “<rormat—j] son” 
CeculLe = Url Jio2 Ur Lopent yg! Url) .reaq () 

data = json.loads (result) 

print data['query']['results'] 


To get an understanding for what the API can provide, replace item. condition (in what 1s 
fundamentally an embedded SQL query) with «. This query outputs a lot of information, but digging 
through it reveals valuable information, including the current conditions: 


{ 


"Channe-.L?s 4 
Titem': { 
'GOncl1 tion =. 4 
"date': 'Thu, 14 May 2015 03:00 AM PDT', ‘'text': 'Cloudy', ‘'code': 
'26', 'temp!: '46!' 


} 


7-day forecasts containing the following information: 


{ 


Pacem t. 4 
"forecast!: { 
"code': '39', 'text': 'Scattered Showers", ‘high': '60', 'low': ‘'44', 
'Oeite’s “IO May Z0lLo?;, day's * sac" 


} 


And other current weather information: 


fase LOMOlny, = +4 
‘sunsets “8730 pm’, “SUnrise’s "S130 am 
‘wine’ s 4 
roOLreCuLoOn? 3 '270', 'speed': '4', 'chill': '46' 


For the purpose of building a training dataset, we extracted data on a daily basis via an automated 
script that ran from May 2015 to January 2016. The forecasts may not be terribly useful to us as it's 
likely that our model will rerun over current data on a daily basis rather than being dependent on 
forecasts. However, we will definitely make use of the wind. direction, wind. speed, and 


WCC Mak La variables, as well as the condition. temperature and condition.text variables. 
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In terms of how to further process this information, one option jumps to mind. One-hot encoding of 
weather tags would enable us to use weather condition information as categorical variables, just as 
we did in the preceding chapter. This seems like a necessary step to take. This significantly inflates 
our feature set, leaving us with the following data: 


{ 

‘DiertupehonlLirOrmacion"s 4 
‘Dare’: "15=U5=2Z015", 
'"TranslinkTwitter': [{ 

Service” = Ut, 

"DisruptioniIncidentCount!: '4' 
ty { 

“oeryLlee. 2 Vir, 

'DiIsSrupUeLonincveaentCouna’ s 
b | 

by 

rPANOTrartrLCArPL™: 4 
"Newlne1centCount’: * 1°, 
rSeverel neroencCoune 3 *i1", 
oreormeemcCOouUmME” = * 4 

by 

'COnsUmeriwittcer’ = 4 
'DisruptionTweetCount!: '4' 

by 

"YahooWeather':{ 

"temp: '45! 
rEOroaco . *O", 
fErOptCal Seorm’ «= Ue, 
"hurr icane’s “Or, 
"Severe Thunderstorms”: '0*, 
‘thunderstorms’: *0", 
‘mixed rain and snow!: 'O', 
"mixed rain and sleet': 'O', 
"mixed snow and sleet': 'Q', 
"freezing drizgzie™: '0', 
‘Chie e TU, 
"freezing rain': '‘'O', 
Pshowers = '0*, 
"snow flurries': 'OQ', 
‘light snow showers!'!: 'OQ', 
‘DLOWing sSnow*? *0*, 
"snow': 'O', 
‘hart’. *0*, 
cleet es *it, 
Gust... 3 "0", 
'foggy': '0', 
‘haze’: *0O",; 
renomy* = * 0. 
‘"OLustery’$ "0%, 
VO a FOL y 
‘COLa's $0 %, 
“eLOmay 3 "il", 
‘mostly Cloudy (might) *: *0", WOW! eBook 
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Ypartily cloucy “Garght) *s *0", 


"partly cloudy (day)'s *0O*, 
"clear (night)': '‘'O', 

"Sumy TO, 

"Paar (maohe)*s tO, 

‘Tair (Gay)*s *O", 

"'MLxed Taam and haai*: *0*,; 
tio, = 

"isolated thunderstorms!: 'Q', 
'scaltered thunderstorms ':s 'O*", 
‘scattered showers’: "0", 
"heavy snow!': 'O', 

"scattered snow showers!: 'O', 
“‘Dareiy Chogdy*s “Or, 
rahnunoershowers’ = 0%, 

‘snow Showers’: *0"', 

‘isolated thundershowers!: 'OQ', 
"not available': 'OQ', 


} 


It's very likely that a lot of time could be valuably sunk into further enriching the weather data 
provided by the Yahoo Weather API. For the first pass, as always, we'll remain focused on building a 
model that takes the features that we described previously. 


Note 


It's definitely worth considering how we would do further work with this data. In this case, it's 
important to distinguish between cross-column data transformations and cross-row transformations. 


A cross-column transformation is one where variables from different features in the same input case 
were transformed based on one another. For instance, we might take the start date and end date of a 
case and use it to calculate the duration. Interestingly, the majority of the techniques that we've studied 
in this book won't gain a lot from many such transformations. Most machine learning techniques 
capable of drawing nonlinear decision boundaries tend to encode relationships between variables in 
their modeling of a dataset. Deep learning techniques often take this capability a step further. This is 
part of the reason that some feature engineering techniques (particularly basic transformations) add 
less value for deep learning applications. 


Meanwhile, a cross-row transformation is typically an aggregation. The central tendency of the last n- 
many duration values, for instance, 1s a feature that can be derived by an operation over multiple 
rows. Naturally, some features can be derived by a combination of column-wise and row-wise 
operations. The interesting thing about cross-row transformations 1s that it's usually quite unlikely that 
a model will train to recognize them, meaning that they tend to continue to add value in very particular 
contexts. 


The reason that this information 1s relevant, of course, is that recent weather is a context in which 
features derived from cross-row operations might add new information to our model. Change in 


barometric pressure or temperature over the lagtw,heurg, for instance, might be a more useful variable 
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than the current pressure or temperature. (Particularly, when that our model 1s intended to predict 
commutes to take place later in the same day!) 


The next step is to rerun our model. This time, our AUC 1s a little higher; we're scoring 0.534. 
Looking at our confusion matrix, we're also seeing improvements: 


Prediction 











If the issues are linked to weather factors, continuing to pull weather data is a good idea; setting this 
solution up to run over an extended period will gradually gather longitudinal inputs from each source, 
eradually giving us much more reliable predictions. 


At this point, we're only a short distance away from our MVP target. We can continue to extend our 
input dataset, but the smart solution 1s to find another way to approach the problem. There are two 
actions that we can meaningfully take. 


Note 


Being human, data scientists tend to think in terms of simplifying assumptions. One of these that crops 
up quite frequently is basically an application of the Pareto principle to cost/benefit analysis 
decisions. Fundamentally, the Pareto principle states that for many events, roughly 80% of the value 
or effect comes from roughly 20% of the input effort, or cause, obeying what's referred to as a Pareto 
distribution. This concept is very popular in software engineering contexts among others, as it can 
eulde efficiency improvements. 
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The Pareto principle of Time 
versus Result 


The “Trivial 
Many" 80% of time 


expended 























The "Vital 
Few" 





To apply this theory to the current case, we know that we could spend more time finessing our feature 
engineering. There are techniques that we haven't applied and other features that we could create. 
However, at the same time, we know that there are entire areas that we haven't touched: external data 
searches and model changes, particularly, which we could quickly try. It makes sense to explore these 
cheap but potentially impactful options on our next pass before digging into additional dataset 
preparation. 


During our exploratory analysis, we noticed that some of our variables are quite sparse. It wasn't 
immediately clear how helpful they all were (particularly for stations where fewer incidents of a 
given type occurred). 


Let's test out our variable set using some of the techniques that we worked with earlier in the chapter. 
In particular, let's apply Lasso to the problem of reducing our feature set to a performant subset: 


fromsklearn.preprocessing import StandardScaler 


scaler = StandardScaler () 

K= SCaler, Le Voie Oi (Pree oL On. Orme On| eat i.) 
Y = Disrupecloninrormacionr|"™targeu” | 

Hames. = Distuplion I nrtormation |’ feature names™ | 


lasso = Lasso(alpha=.3) 
lasso.fit(X, Y) 


Prantl “hasso Models “, preety print linear (tasso.coct , Games, Sort = True) 


This output is immediately valuable. It's obvious that many of the weather features (either through not 
showing up sufficiently often or not telling us anything useful when they do) are adding nothing to our 
model and should be removed. In addition, w cifennor getting a lot of value from our traffic aggregates. 


www.woweb 


While these can remain in for the moment (in the hope that gathering more data will improve their 
usefulness), for our next pass we'll rerun our model without the poorly-scoring features that our use of 
LASSO has revealed. 


There is one fairly cheap additional change, which we ought to make: we should upgrade our model 
to one that can fit nonlinearly and thus can fit to approximate any function. This 1s worth doing 
because, as we observed, some of our features showed a range of skewed distributions indicative of 
a nonlinear underlying trend. Let's apply a random forest to this dataset: 


fromsklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier 
tf = RancomPorestregqressor (hn Jobs: = 32, Verbose = 2, 1 Sstimavors=2)) 
tl eit te (Dt Ssrupl Onin rOrmaclOn ECraim.targece,DisTuplLiOn nM LOrmaciOonm train. Cate) 


tZ = YZ. SCOLe(DiIstuprionintormeation.data, 
rf.predict (DisruptionInformation.targets) ) 
mse = np.mean((DisruptioniInformation.data - 


PEs preo1.Ce (DIsSrTuplaonl nrormactitonm. tCargecs) )**Z) 


pDiluScatter (Disruptioninrormacion.data, 

rf.predict (DisruptioniInformation.targets) ) 

pl.plot(np.arange(8, 15), np.arange(8, 15), label="r*2=" + str(r2), c="r") 
pl.legend(loc="lower right") 

pl.title("RandomForest Regression with scikit-learn") 

pl.show () 


Let's return again to our confusion matrix: 








Actual 
Result 





At this point, we're doing fairly well. A simple upgrade to our model has yielded significant 
improvements, with our model correctly identifying almost 40% of commute delay incidents (enough 
to start to be useful to us!), while misclassifying a small amount of cases. 


Frustratingly, this model would still be getting us out of bed early incorrectly more times than it 
would correctly. The gold standard, of course, would be if it were predicting more commute delays 
than it was causing false (early) starts! We could reasonably hope to achieve this target if we continue 
to gather feature data over a sustained period; the main weakness of this model is that it has very few 
cases to sample from, given the rarity of commute disruption events. 


We have, however, succeeded in gathering and marshaling a range of data from different sources in 


order to create a model from freely-available Wat! tiwp¥ields a recognizable, real-world benefit 
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(reducing the amount of late arrivals at work by 40%). This is definitely an achievement to be happy 
with! 
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Further reading 


My suggested go-to introduction to feature selection is Ando Sabaas' four-part exploration of a broad 
range of feature selection techniques. It's full of Python code snippets and informed commentary. Get 


started at http://blog.datadive.net/selecting-good-features-part-1-univariate-selection/. 


For a discussion on feature selection and engineering that ranges across materials in chapters 6 and 7, 
consider Alexandre Bourhard-Cote's slides at http://people.eecs.berkeley.edu/~jordan/courses/294- 
fall09/lectures/feature/slides.pdf. Also consider reviewing Jeff Howbert's slides at 
http://courses.washington.edu/css490/2012.Winter/lecture_slides/O5a_feature creation selection.pdf 


There is a shortage of thorough discussion of feature creation, with a lot of available material 
discussing either dimensionality reduction techniques or very specific feature creation as required 1n 
specific domains. One way to get a more general understanding of the range of possible 
transformations 1s to read code documentation. A decent place to build on your existing knowledge is 
Spark ML's feature-transformation algorithm documentation at 
https://spark.apache.org/docs/1.5.1/ml-features.html#feature-transformers, which describes a broad 
set of possible transformations on numerical and text features. Remember, though, that feature creation 
is often problem-specific, domain-specific, and a highly creative process. Once you've learned a 
range of technical options, the trick is in figuring out how to apply these techniques to the problem at 
hand! 


For readers with an interest in hyperparameter optimization, I recommend that you read Alice Zheng's 


posts on Tur1's blog as a great place to start: http://blog.tur1.com/how-to-evaluate-machine-learning- 
models-part-4-hyperparameter-tuning. 


I also find the scikit-learn documentation to be a useful reference for grid search specifically: 


http://scikit-learn.org/stable/modules/grid_search.html. 
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Summary 


In this chapter, you learned and applied a set of techniques that enable us to effectively build and 
finesse datasets for machine learning, starting from very little initial data. These powerful techniques 
enable a data scientist to turn seemingly shallow datasets into opportunities. We demonstrated this 
power using a set of customer service tweets to create a travel disruption predictor. 


In order to take that solution into production, though, we'd need to add some functionality. Removing 
some locations in the penultimate step was a questionable decision; if this solution is intended to 
identify journey disruption risk, then removing locations seems like a non-starter! This 1s particularly 
true given that we do not have year-round data and so cannot identify the effect of seasonal or 
longitudinal trends (like extended maintenance works or a scheduled station closure). We were a little 
hasty in removing these elements and a better solution would be to retain them for a longer period. 


Following on from these concerns, we should recognize the need to start building some dynamism 
into our solution. When spring rolls around and our dataset starts to contain new climate conditions, it 
is entirely likely that our model will fail to adapt as effectively. In the next chapter, we will be 
looking at building more sophisticated model ensembles and discuss methods of building robustness 
into your model solutions. 
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Chapter 8. Ensemble Methods 


As we progressed through the earlier chapters of this book, you learned how to apply a number of 
new techniques. We developed our use of several advanced machine learning algorithms and 
acquired a broad range of companion techniques used to enhance your use of learning techniques via 
more effective feature selection and preparation. This chapter seeks to enhance your existing 
technique set using ensemble methods: techniques that bind multiple different models together to 
solve a real-world problem. 


Ensemble techniques have become a fundamental part of the data scientist's toolset. The use of 
ensembles has become common practice in competitive machine learning contexts, and ensembles are 
now considered an indispensable tool in many contexts. The techniques that we'll develop in this 
chapter give our models an edge 1n performance, while increasing their robustness to underlying data 
change. 


We'll examine a series of ensembling options, discussing both the code and application of these 
techniques. We'll color this explanation with guidance and reference to real-world applications, 
including the models created by successful Kagglers. 


The development of any of the models that we reviewed in this title allows us to solve a wide range 
of data problems, but applying our models to production contexts raises an additional set of 
problems. Our solutions are still vulnerable to changes in the underlying observations. Whether this 1s 
expressed 1n a different population of individuals, in temporal variations (for example, seasonal 
changes in the phenomenon being captured) or by other changes to the underlying conditions, the end 
result 1s often the same—the models that worked well in the conditions they were trained against are 
frequently unable to generalize and continue to perform well as time passes. 


The final section of this chapter describes methodologies to transfer the techniques from this book to 
operational environments and the kinds of additional monitoring and support you should consider 1f 
your intended applications have to be resilient to change. 
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Introducing ensembles 


"This is how you win ML competitions: you take other peoples' work and ensemble them together." 


--Vitaly Kuznetsov NIPS2014 


In the context of machine learning, an ensemble is a set of models that 1s used to solve a shared 
problem. An ensemble is made up of two components: a set of models and a set of decision rules that 
govern how the results of those models are combined into a single output. 


Ensembles offer a data scientist the ability to construct multiple solutions for a given problem and 
then combine these into a single final result that draws from the best elements of each input solution. 
This provides robustness against noise, which is reflected 1n more effective training against an initial 
dataset (leading to lower levels of overfitting and reductions in training error) and against data 
change of the kinds discussed 1n the preceding section. 


It is no exaggeration to say that ensembles are the most important recent development in machine 
learning. 


In addition, ensembles enable greater flexibility in how one solves for a given problem, in that they 
enable the data scientist to test different parts of a solution and resolve issues specific to subsets of 
the input data or parts of the models in use, without completely retuning the whole model. As we'll 
see, this can make life easier! 


Ensembles are typically considered as falling into one of several classes, based on the nature of the 
decision rules used. The key ensemble types are as follows: 


e Averaging methods: They develop models 1n parallel and then use averaging or voting 
techniques to develop a combined estimator 

e Stacking (or Blending) methods: They use the weighted output of multiple classifiers as inputs 
to a next-layer model 

e Boosting methods: They involve building models 1n sequence where each added model aims to 
improve the score of the combined estimator 


Given the importance and utility of both of these classes of the ensemble method, we'll treat each one 
in turn: discussing theory, algorithm options, and real-world examples. 
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Understanding averaging ensembles 


Averaging ensembles have a long and rich history in the physical sciences and statistical modeling, 
seeing a common application in many contexts including molecular dynamics and audio signal 
processing. Such ensembles are typically seen as almost exactly replicated cases of a given system. 
The average (mean) values of and variance between cases in this system are key values for the system 
as a whole. 


In a machine learning context, an averaging ensemble 1s a collection of models that train on the same 
dataset, whose results are aggregated 1n a range of ways. Depending on implementation goals, an 
averaging ensemble can bring several benefits. 


Averaging ensembles can be used to reduce the variability of a model's performance. One common 
method involves creating multiple model configurations that take different parameter subsets as input. 
Techniques that take this approach are referred to collectively as bagging algorithms. 


Using bagging algorithms 


Different bagging implementations will operate differently but share the common property of taking 
random subsets of the feature space. There are four main types of the bagging approach. Pasting 
draws random subsets of the samples without replacement. When this is done with replacement, then 
the approach is simply called bagging. Pasting 1s typically computationally cheaper than bagging and 
can yield similar results in simpler applications. 


When samples are taken feature-wise, the method 1s known as random subspaces. Random subspace 
methods provide a slightly different capability; they essentially reduce the need for extensive, highly 
optimized feature selection. Where such activities typically lead to a single model with optimized 
input, random subspaces allow the use of multiple configurations in parallel, with a flattening of the 
variance of any one solution. 


Note 


While the use of an ensemble to reduce the variability in model performance may sound like a 
performance hit (the natural response might be but why not just pick the single best performing model 
in the ensemble’), there are big advantages to this approach. 


Firstly, as discussed, averaging improves the ability of your model set to adapt to unfamiliar noise 
(that is, it reduces overfitting). Secondly, an ensemble can be used to target different elements of the 
input dataset to model effectively. This 1s a common approach 1n competitive machine learning 
contexts, where a data scientist will iteratively adjust the ensemble based on the results of 
classification and particular types of failure cases. In some cases, this 1s an exhaustive process 
involving the inspection of model results (commonly as part of a normal, iterative model development 
process) but many data scientists prefer techniques or a solution that they will implement first. 


Random subspaces can be a very powerful approe 3 1 batfigularly if it's possible to use multiple 


subspace sizes and exhaustively check feature combinations. The cost of random subspace methods 
increases nonlinearly with the size of your dataset and, beyond a certain point, 1t will become costly 
to test every configuration of parameters for multiple subspace sizes. 


Finally, an ensemble's estimators may be created from subsets drawn from both samples and features, 
in a method known as random patches. On a like-for-like case, the performance of random patches 1s 
usually around the same level as that of random subspace techniques with significantly reduced 
memory consumption. 


As we've discussed the theory behind bagging ensembles, let's look at how we go about implementing 
one. The following code describes a random patches classifier implemented using sklearn's 
BaggingClassifier class: 


from, sklearn.Cross ValLiOal1on 2MpOrt. Cross Val score 
from sklearn.ensemble import BaggingClassifier 

from sklearn.neighbors import KNeighborsClassifier 
Prom shleeri.cGaltasers Ampore jOad O1:017s 

from sklearn.preprocessing import scale 


GLOTES = 2Oad. O1g1cS 
data = scale(digits.data) 
X = data 

y = digits.target 


bagging = BeqoinoC lassi fier (KNelonborsClassificr(); Mex samp les=U.5, 
Max, Pealures=U. 5) 
SCOlLGS = Cross Val. SCOre (bagging; x, Vv 


mean scores.mean () 


print (scores) 
print (mean) 


As with many sklearn classifiers, the core code needed is very straightforward; the classifier is 
initialized and used to score the dataset. Cross-validation (via cross val score) adds no 
meaningful complexity. 


This bagging classifier used a K-Nearest Neighbors (IKNN) classifier (KNeighboursClassifier) 
as a base, with feature-wise and case-wise sampling rates each set to 50%. This outputs very strong 
results against the digits dataset, correctly classifying a mean of 93% of cases after cross-validation: 


[ 0.94019934 0.92320534 0.9295302 ] 
0.930978293043 


Using random forests 


An alternative set of averaging ensemble techniques is referred to collectively as random forests. 
Perhaps the most successful ensemble technique used by competitive data scientists, random forests 


develop parallel sets of decision tree classifi¢rs,Bygyatroducing two main sources of randomness to 
www.wowebook.org 


the classifier construction, the forest ends up containing diverse trees. The data that is used to build 
each tree 1s sampled with replacement from the training set, while the tree creation process no longer 
uses the best split from all features, instead choosing the best split from a random subset of the 
features. 


Random forests can be easily called using the RandomForestClassifier Class Insklearn. Fora 
simple example, consider the following: 


import numpy as np 

from sklearn.ensemble import RandomForestClassifier 
From, SkKLearnwCatasets IMpOre Loac. 0101S 

from sklearn.preprocessing import scale 


GHG d.ee =" koa, angiS() 
data = scale(digits.data) 


i comp les, TW Teelures = Gara .cnape 
nh Gigits = Len(np.unigue (Gigits.targer) ) 
labels = digits.target 


Gil = Rancgomroresec lace t ler() estimators =10) 
cle = clf.fit(data, labels) 
scores = clf.score(data, labels) 


print (scores) 


The scores output by this ensemble, 0.999, are difficult to beat. Indeed, we haven't seen performance 
at this level from any of the individual models we employed in preceding chapters. 


A variant of random forests, called extremely randomized trees (ExtraTrees), uses the same 
random subset of features method in selecting the best split at each branch in the tree. However, it 
also randomizes the discrimination threshold; where a decision tree normally chooses the most 
effective split between classes, ExtraTrees split at a random value. 


Due to the relatively efficient training of decision trees, a random forest algorithm can potentially 
support a large number of varied trees with the effectiveness of the classifier improving as the 
number of nodes increases. The randomness introduced provides a degree of robustness to noise or 
data change; like the bagging algorithms we reviewed earlier, however, this gain typically comes at 
the cost of a slight drop in performance. In the case of ExtraTrees, the robustness may increase further 
while the performance measure improves (typically a bias value reduces). 


The following code describes how ExtraTrees work in practice. As with our random subspace 
implementation, the code is very straightforward. In this case, we'll develop a set of models to 
compare how ExtraTrees shape up against tree and random forest approaches: 


from, SkleGaGrin.CrOss Valication aAMpOrl Cross: Val score 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import ExtraTreesClassifier 
from sklearn.tree import DecisionTreeClassifier 


from sklearn.datasets import load diWOMé eBook 
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from sklearn.preprocessing import scale 


G2010S = J|Oad 21g1es () 
data = scale(digits.data) 
X = data 

y = digits.target 


Cle = DeCcis10nTresC lassi iter (mex Ceplh=Nole, Min Samples split, 
rancom State—0) 
SCOGes — CiOss Vel eco etc, xy 7) 


print (scores) 


CLE = Ravgomrores Classi f1er (mn -estimearors=10, Max cGeplhn=None, 
Mam Samples Sspolitc=l, random Stave=0) 

SCOLGS = Cross. Val ScCoOre(Cliy, Ay ¥) 

print (scores) 


CLE = EXtCralreesclassii ier (nh CStimavors=10, Max ceprtni-None, 
Min samples splat=l, random state=0) 

SCOres = Cross Val Score(cCliy, A. ¥) 

print (scores) 


The scores, respectively, are as follows: 


[ 0.74252492 0.82136895 0.75671141] 
[ 0.88372093 0.9015025 0.8909396 ] 
[ 0.91694352 0.93489149 0.91778523] 


Given that we're working with entirely tree-based methods here, the score 1s simply the proportion of 
correctly-labeled cases. We can see here that there isn't much 1n it between the two forest methods, 
which both perform strongly with mean scores of 0.9. In this example, random forest actually wins out 
marginally (on the order of an 0.002 increase) over ExtraTrees, while both techniques substantially 
outperform the basic decision tree, whose mean score sits at 0.77. 


One drawback when working with random forests (especially as the size of the forest increases) 1s 
that 1t can be hard to review the effectiveness of, or tune, a given implementation. While individual 
trees are extremely easy to work with, the sheer number of trees 1n a developed ensemble and the 
obfuscation created by random splitting can make it challenging to refine a random forest 
implementation. One option is to begin looking at the decision boundaries that individual models 
draw. By contrasting the models within one's ensemble, it becomes easier to identify where one 
model performs better at dividing classes than others. 


In this example, for instance, we can easily see how our models perform at a high level without 
digging into specific details: 
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_. Classifiers on feature subsets of the Iris datase 
Decisioniree RandomForest Extralrees aBoost 





While it can be challenging to understand beyond a simple level (using high-level plots and summary 
scores) how a random forest implementation 1s performing, the hardship is worthwhile. Random 
forests perform very strongly with only a mimimal cost in additional computation. They are very often 
a good technique to throw at a problem during the early stages, while one is still determining an angle 
of attack, because their ability to yield strong results fast can provide a useful benchmark. Once you 
know how a random forest implementation performs, you can begin to optimize and extend your 
ensemble. 


To this end, we should continue exploring the different ensemble techniques so as to further build out 
our toolkit of ensembling options. 
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Applying boosting methods 


Another approach to ensemble creation is to build boosting models. These models are characterized 
by their use of multiple models in sequence to iteratively "boost" or improve the performance of the 
ensemble. 


Boosting models frequently use a series of weak learners, models that provide only marginal gain 
compared to random guessing. At each iteration, a new weak learner is trained on an adjusted dataset. 
Over multiple iterations, the ensemble is extended with one new tree (whichever tree optimized the 
ensemble performance score) at each iteration. 


Perhaps the most well-known boosting method is AdaBoost, which adjusts the dataset at each 
iteration by performing the following actions: 


e Selecting a decision stump (a shallow, often one-level decision tree, effectively the most 
significant decision boundary for the dataset in question) 

e Increasing the weighting of cases that the decision stump labeled incorrectly, while reducing the 
weighting of correctly labeled cases 


This iterative weight adjustment causes each new classifier in the ensemble to prioritize training the 
incorrectly labeled cases; the model adjusts by targeting highly-weighted data points. Eventually, the 
stumps are combined to form a final classifier. 


AdaBoost can be used both in classification and regression contexts and achieves impressive results. 
The following example shows an AdaBoost implementation 1n action on the heart dataset: 


import numpy as np 


from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import AdaBoostClassifier 
ECOM, Sklearn.datasets.mMLoOate 1mpore Terven midata 
rom skilearn.Croscs ValiGarion 2Mpore Cross Valk Seore 


in. SSlimators = 4200 
# A learning rate of 1. may not be optimal for both SAMME and SAMME.R 
Leach Laue t= Ly 


heart = fetch mladata("neart™) 
xm = Neart.cata 

y = np.copy(heart.target) 
Va 


x Lest, VY Test = 2Pleogtl-; Vile. 
me ieee: AP eo = Jee, Vie! 


Ot Stump. = Decisioniresc lassi fier (max cepra=l, Mau. Samples beat=1) 
du sSuump tit x train, Yo train) 


Ge Stunp corr = 12.0 = Gt Stump .Sscore (x 16s, YY Cesl) 
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Ot. = Decision res Classitier (max Geptn= 9, Min samples j1ear=1) 
Obese ey, 9 ee iy) 
Ge evi = 120: = Cl. Seore(.. tet, est) 


ada. OlLSeCrelte — AdaboostC Classifier ( 
Dase CStimaror—-ct Seump,; 
Heerning teve=leerning Laue, 
nh SStimatlors=1 SSstimarors, 
algorithm="SAMME") 

ada O1Screlte.tiIL(x Eran, Y trai) 


SCOres = Cross Val SCoOre(ada ClLscrete, * test, YY test) 
print (scores) 
means = scores.mean() 


print (means) 


In this case, the n estimators parameter dictates the number of weak learners used; 1n the case of 
averaging methods, adding estimators will always reduce the bias of your model, but will increase 
the probability that your model has overfit its training data. The base estimator parameter can be 
used to define different weak learners; the default is decision trees (as training a weak tree 1s 
straightforward, one can use stumps, very shallow trees). When applied to the heart dataset, as in 
this example, AdaBoost achieved correct labeling 1n just over 79% of cases, a reasonably solid 
performance for a first pass: 


[ 0.77777778 0.81481481 0.77777778] 


0.79012345679 


Boosting models provide a significant advantage over averaging models; they make it much easier to 
create an ensemble that identifies problem cases or types of problem cases and address them. A 
boosting model will usually target the easiest to predict cases first, with each added model fitting 
against a subset of the remaining incorrectly predicted cases. 


One resulting risk 1s that a boosting model begins to overfit (in the most extreme case, you can 
imagine ensemble components that have fit to specific cases!) the training data. Managing the correct 
amount of ensemble components 1s a tricky problem but thankfully we can resort to a familiar 
technique to resolve it. In Chapter 1, Unsupervised Machine Learning, we discussed a visual 
heuristic called the elbow method. In that case, the plot was of K (the number of means), against a 
performance measure for the clustering implementation. In this case, we can employ an analogous 
process using the number of estimators () and the bias or error rate for the ensemble (which we'll 
call e). For a range of different boosting estimators, we can plot their outputs as follows: 
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By identifying a point at which the curve has begun to level off, we can reduce the risk that our model 
has overfit, which becomes increasingly likely as the curve begins to level off. This is true for the 
simple reason that as the curve levels, 1t necessarily means that the added gains from each new 
estimator are the correct classification of fewer and fewer cases! 


Part of the appeal of a visual aid of this kind is that it enables us to get a feel for how likely our 
solution is to be overfitting. We can (and should!) be applying validation techniques wherever we 
can, but in some cases (for example, when aiming to hit a particular MVP target for a model 
implementation, whether that be informed by use cases or the distribution of scores on the Kaggle 
public leaderboard), we may be tempted to press forward with a performant implementation. 
Understanding exactly how attenuated the gains we're receiving are as we add each new estimator 1s 
critical to understanding the risk of overfitting. 


Using XGBoost 


In mid-2015, a new algorithm to solve structured machine learning problems, XGboost, has taken the 
competitive data science world by storm. Extreme Gradient Boosting (XGBoost) is a well-written, 
performant library that provides a generalized boosting algorithm (Gradient Boosting). 


XGBoost works much like AdaBoost with one key difference—the means by which the model 1s 
improved 1s different. 


At each iteration, XGBoost 1s seeking to improve the performance of the existing model set by 
reducing the residuals (the differences between targets and label predictions) of that ensemble. Every 
iteration, the model added 1s selected based on whether it is most able to reduce the existing 
ensemble's residuals. This is analogous to gradient descent (where a function 1s iteratively minimized 
by moving against a loss gradient); hence, the name Gradient Boosting. 


Gradient Boosting has proven to be highly successtul 1 in recent Kagele contests, where it has 
supported the winners of the CrowdFloweg, Compe 19h.and Microsoft Malware Classification 


Challenge, along with many other structured data competitions in the final half of 2015. 


To apply XGBoost, let's grab the XGBoost library. The best way to get this is via pip, with the pip 
install xgboost command on the command line. For Windows users, pip installation is currently 
(late 2015) disabled on Windows. For your benefit, a cold copy of XGBoost is available in the 
Chapter 8 folder of this book's GitHub repository. 


Applying XGBoost 1s fairly straightforward. In this case, we'll apply the library to a multiclass 
classification task, using the UCI Dermatology dataset. This dataset contains an age variable and a 
large number of categorical variables. An example row of data looks like this: 


3,2,0,2,0,0,0,0,0,0,0,0,1,2,0,2,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,10,2 
A small number of age values (penultimate feature) are missing, encoded by ?. The objective in 


working with this dataset is to correctly classify one of six different skin conditions, per the following 
class distribution: 


Database: Dermatology 


Class code: Class: Number of instances: 
1 psoriasis 112 

2 seboreic dermatitis 61 

3 lichen planus 72 

4 pityriasis rosea 49 

5 cronic dermatitis 52 

6 pityriasis rubra pilaris 20 


We'll begin applying XGBoost to this problem by loading up the data and dividing it into test and 
train cases via a 70/30 split: 


import numpy as np 
import xgboost as xgb 


data = np.loadtxt('./dermatology.data', delimiter=',',converters={33: lambda 
X:int(x == '?'), 34: lambda x:int(x)-1l } ) 

Sz = data.shape 

train = data[:int(sz[0O] * 0.7), :] 


test = data[int(sz[O] * 0.7):, 2:2] 


Urata. ©. = Crain (a, 0soo| 
train Y = train se, 34] 


Lest, A = LESte|s,02o3) 
best 1 = testis, 24 


At this point, we initialize and parameterize our model. The eta parameter defines the step size 


shrinkage. In gradient descent algorithms, it's wexy:eenamon to use a shrinkage parameter to reduce the 
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size of an update. Gradient descent algorithms have a tendency (especially close to convergence) to 
zigzag back and forth over the optimum; using a shrinkage parameter to downscale the size of a 
change makes the effect of gradient descent more precise. A common (and default) scaling value is 
0.3. In this example, eta has been set to 0.1 for even greater precision (at the possible cost of more 
iterations). 


The max depth parameter 1s intuitive; it defines the maximum depth of any tree in the example. Given 
Six Output classes, six is a reasonable value to begin with. The num round parameter defines how 
many rounds of Gradient Boosting the algorithm will perform. Again, you typically require more 
rounds for a multiclass problem with more classes. The nthread parameter, meanwhile, defines how 
many CPU threads the code will run over. 


The pMat rix structure used here is purely for the training speed and memory optimization. It's 
generally a good idea to use these while using XGBoost; they can be built from numpy. arrays. Using 
DMatrix enables the watchlist functionality, which unlocks some advanced features. In particular, 
watchlist allows us to monitor the evaluation results on all the data in the list provided: 


“9G train = xob.DMatrix( train x~, label=tCrain 7) 
“GQ test. = 295.DMacCri x (test x, Jabel— test.) 


param = {} 

param['objective'] = 'multi:softmax' 

param['eta'] = 0.1 

param['max depth'] = 6 

param['nthread'] = 4 

param|*num class”] = 6 

Wetehlict = | (xo -tiain, train jy (xo test, *tese ). | 
num 2OuUnGd = oO 

Det = xOO-Urain (param, xO Train, Dum TrounG, Walcnlaste ); 


We train our model, bst, to generate an initial prediction. We then repeat the training process to 
generate a prediction with softmax enabled (via multi:softprob): 


Pied = Dst.predicec(. xo test )7 


Print {"precicuing, Classiitcavrion error=<l* @ (sum Ine(pred|a)) t= test Fiz 
For 2 if pange(lenitest v))) £ tloac(len(test 2) ) 2.) 

param['objective'] = 'multi:softprob' 

bse = SOD-Urain (param, xg Eraim, HUM round, watenlist ); 


VpLOob = Deteprecicul So test. )«estape( test Yashape|(0], © ) 


ylabel = np.argmax(yprob, axis=1) 
Print (predicting, Cleassuticatiom Stror=cl* < (SUumt 2ne(ylabel(.)) 2= £est 714) 
for 1 in range(len(test Y))) / floatwWwow épeck Y)) )) 
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Using stacking ensembles 


The traditional ensembles that we saw earlier in this chapter all shared a common design philosophy: 
they involve multiple classifiers trained to fit a set of target labels and involve the models themselves 
being applied to generate some meta-function through strategies including model voting and boosting. 


There is an alternative design philosophy as regards ensemble creation, known as stacking or, 
alternatively, as blending. Stacking involves multiple layers of models 1n a configuration where the 
output of one layer of models is used as training data for a model at the next layer. It's possible to 
blend hundreds of different models successfully. 


Stacking ensembles can also make up the blended set of features at a layer's output from multiple sub- 
blends (sometimes called blend-of-blends). To add to the fun, it's also possible to also extract 
particularly effective parameters from the models of a stacking ensemble and use them as meta- 
features, within blends or sub-blends at different levels. 


All of this combines to make stacking ensembles a very powerful and extensible technique. The 
winners of the Kaggle Netflix prize (and associated $1 million award) used stacking ensembles over 
hundreds of features to great effect. They used several additional tricks to improve the effectiveness 
of their prediction: 


e They trained and optimized their ensemble while holding out some data. They then retrained 
using the held-out data and again optimized before applying their model to the test dataset. This 
isn't an uncommon practice, but it yields good results and 1s worth keeping in mind. 

They trained using gradient descent and RMSE as the performance function. Crucially, they used 
the RMSE of the ensemble, rather than that of any of the models, as the relevant performance 
indicator (the measure of residuals). This should be considered a healthy practice whenever 
working with ensembles. 

e They used model combinations that are known to improve on the residuals of other models. 
Neighborhood-based approaches, for instance, improve on the residuals of the RBM, which we 
examined earlier in this book. By getting to know the relative strengths and weaknesses of your 
machine learning algorithms, you can find ideal ensemble configurations. 

They calculated the residuals of their blend using k-fold cross-validation, another technique that 
we explored and applied earlier in this book. This helped overcome the fact that they'd trained 
their blend's constituent models using the same dataset as the resulting blend. 


The main point to take away from the highly customized nature of the Pragmatic Chaos model used to 
win the Netflix prize is that a first-class model is usually the product of intensive iteration and some 
creative network configuration changes. The other key takeaway 1s that the basic architectural pattern 
of a stacking ensemble 1s as follows: 
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Concept Diagram of Stacking 


training data | 
a classifier output value 


training data output value ) output value 
= classifier | ————->- | _ Classifier = 
training data | | el 
ee , output value 


Level O Level 1 





Now that you've learned the fundamentals of how the stacking ensemble work, let's try applying them 
to solve data problems. To get us started, we'll use the blend. py code provided in the GitHub 
repository accompanying Chapter 8,. Versions of this blending code have been used by highly- 
scoring Kageglers across multiple contests. 


To begin with, we'll examine how stacking ensembles can be applied to attack a real data science 
problem: the Kagele contest Predicting a Biological Response aimed to build as effective a model as 
possible in order to predict the biological response of molecules given their chemical properties. 
We'll be looking at one particularly successful entry 1n this competition to understand how stacking 
ensembles can work in practice. 


In this dataset, each row represents a molecule, while each of the 1,776 features describe 
characteristics of the molecule in question. The goal was to predict a binary response from the 
molecule in question, given these properties. 


The code that we'll be applying comes from a competitor in that tournament who used a stacking 
ensemble to combine five classifiers: two differently configured random forest classifiers, two extra 
trees classifiers, and a gradient boosting classifier, which helps to yield slightly differentiated 
predictions from the other four components. 


The duplicated classifiers were provided with different split criteria. One used the Gini Impurity 
(ginl), a measure of how often a random record would be incorrectly labeled if it were randomly 
labeled according to the distribution of labels 1n the potential branch in question. The other tree used 
information gain (entropy), a measure of information content. The information content of a potential 
branch can be measured by the number of bits that would be required to encode it. Using entropy as a 
measure to determine the appropriate split leads branches to become increasingly less diverse, but 
it's important to recognize that the entropy and sith na can yield quite different results: 


owebook. 


af. ene ==" Marin "s 
np.random. seed (0) 
n folds = 10 
VetLoOce = iuc 
snuiiilc = Balse 
Ky Vy BOSUbMILSSi10n = Load data.oad) 
it Shuttle: 
1dx = np.random.permutation(y.size) 
X = X[1idx] 
y = ylidx] 
See = Jel oerer LAeohrOolaty, 2 telGe) | 
Clie = ihevoCnrPoOrecerC la oii ter. Sorina core-L00, im 3yO0be-——1, 
CTiterion= qi? ), 
Randomrorest Classifier (a. estimavorse=100, fm j30bs=—1L,; 
criterion='entropy'), 
Extteltees UC lessi tier (i. eSeimarore—100, nn. JO0s——1, 
criterion='gini'), 
ExXtralrecsOlassifieri( nh. estimavore—10U0, nh. J0bs=—1, 
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print "Blending." 
clf = LogisticRegression() 
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clf in enumerate(clfs): 
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VY Submission — Cli .predice Proba(Gataset Dlena, test) | s,1) 


DEine “linear Siuretch Of predicilions to [0,11 
y submission = (y submission - y submission.min()) / 
(Y SUDMLSSiOn.max() - VY SUubmiss10n.min)) 


print "Saving Results." 
Np. SaveLxt (iiame—" tesct.csv', ~=y Submission, tme=".0. 95") 


When we try running this submission on the private leaderboard, we find ourselves in a rather 


impressive j2th place (out of 699 competitors)! Naturally, we can't draw too many conclusions from 
a competition that we entered after completion, but, given the simplicity of the code, this is still a 
rather impressive result! 


Applying ensembles in practice 


One particularly important quality to be mindful of while applying ensemble methods is that your goal 
is to tune the performance of the ensemble rather than of the models that comprise it. Your approach 
should therefore be largely focused on building a strong ensemble performance score, rather than the 
strongest set of individual model performances. 


The amount of attention that you pay to the models within your ensemble will vary. With an 
arrangement of differently configured or initialized models of a single type (for example, a random 
forest), it1s sensible to focus almost entirely on the performance of the ensemble and metaparameters 
that shape 1t. 


For more challenging problems, we frequently need to pay closer attention to the individual models 
within our ensemble. This is most obviously true when we're trying to create smaller ensembles for 
more challenging problems, but to build a truly excellent ensemble, it is often necessary to be 
considerate of the parameters and algorithms underlying the structure that you've built. 


With this said, you'll always be looking at the performance of the ensemble as well as the 
performance of models within the set. You'll be inspecting the results of your models to try and work 
out what each model did well. You'll also be looking for the less obvious factors that affect ensemble 
performance, most notably the correlation of model predictions. It's generally recognized that a more 
effective ensemble will tend to contain performant but uncorrelated components. 


To understand this claim, consider techniques such as correlation measures and PCA that we can use 
to measure the amount of information content present in dataset variables. In the same way, we can 
use Pearson's correlation coefficient against the predictions output by each of our models to 
understand the relationship between performance and correlation for each model. 


Taking us back to stacking ensembles specifically, our ensemble's models are outputting metafeatures 
that are then used as inputs to a next-layer model. Just as we would vet the features used by a more 
conventional neural network, we want to ensure that the features output by our ensemble's components 
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and use of the results 1n model selection 1s an excellent place to start in this regard. 


When we deal with single-model problems, we almost always have to spend some time inspecting the 
problem and identifying an appropriate learning algorithm. If we're faced with a two-class 
classification problem with a moderate amount of features (/0’s) and labeled training cases, we might 
select a logistic regression, an SVM, or some other appropriate algorithm for the context. Different 
approaches will apply to different problems and through trial and error, parallel testing, and 
experience (both personal and posted online!), you will identify the appropriate approach for a 
specific objective given specific input data. 


A similar logic apples to ensemble creation. Rather than identifying a single appropriate model, the 
challenge is to identify combinations of models that effectively describe different elements of an input 
dataset 1n such a way that the dataset as a whole 1s adequately described. By understanding the 
strengths and weaknesses of your component models as well as by exploring and visualizing your 
dataset, you'll be able to draw conclusions about how to develop your ensemble effectively through 
multiple iterations. 


Ultimately, at this level, data science is a field with a great many techniques at hand. The best 
practitioners are able to apply their knowledge of their own algorithms and options to develop very 
effective solutions over many iterations. 


These solutions involve the knowledge of algorithms and interaction of model combinations, model 
parameter adjustments, dataset translations, and ensemble manipulation. Just as importantly, they 
require an uninhibited and creative mindset. 


One good example of this is the work of prominent Kagele competitor, Alexander Guschin. Focusing 
on one specific example—the Otto Product Classification contest—can give us an idea as to the 
range of options available to a confident and creative data scientist. 


Most model development processes begin with a period in which you throw different solutions at the 
problem, attempting to find the tricks underlying the data and figuring out what works. Settling ona 
stacking model, Alexander set about building metafeatures. While we looked at XGBoost as an 
ensemble in its own right, in this case 1t was used as a component to the stacking ensemble in order to 
generate some of the metafeatures to be used by the final model. Neural networks were used in 
addition to the gradient boosted trees as both algorithms tend to produce good results. 


To add some contrast to the mixture, Alexander added a KNN implementation, specifically because 
the results (and therefore the metaparameters) generated by a KNN tend to differ significantly from 
the models already included. This approach of picking up components whose outputs tend to differ 1s 
crucial 1n creating an effective stacking ensemble (and to most ensemble types). 


To further develop this model, Alexander added some custom elements to the second layer of his 
model. While combining the XGBoost and neural network predictions, he also added bagging at this 
layer. At this point, most of the techniques thatweive, discussed in this chapter have shown up in some 
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part of this model. In addition to the model development, some feature engineering (in particular, the 
use of TF-IDF on half of the training and test data) and the use of plotting techniques to identify class 
differentiation were used throughout. 


A truly mature model that can tackle the most significant data science challenges 1s one that combines 
the techniques we've seen throughout this book, created using a solid understanding of the underlying 
algorithms and the possibilities for how these techniques can interact with each other. 


This book so far has taught many of the fundamentals—the base of practical knowledge—that a 
practitioner has to collect. It has used many examples and an increasing amount of real-world cases to 
demonstrate how a broad base of knowledge becomes increasingly powerful in letting you develop 
effective solutions to difficult problems. 


What's required of you as a data scientist is to first apply this broad set of techniques to develop an 
experience of how they can perform and what they could do for you. Then it is up to you to develop 
that creativity and experimental mindset that distinguishes some of the best data scientists. 
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Using models in dynamic applications 


We've spent this chapter discussing the use of techniques to manage model performance under 
conditions that might be seen as ideal; specifically, conditions in which all of the data is available 
ahead of time so that a model can be trained on all data. These assumptions are frequently valid in 
research contexts or when dealing with one-time problems, but in many contexts they are unsafe 
assumptions. The range of unsafe contexts goes beyond the cases where the data 1s simply 
unavailable, such as data science contests where a held-out dataset is used to establish the final 
leaderboard. 


Returning to a subject from earlier in this chapter, you'll recall the Pragmatic Chaos algorithm, which 
won the Netflix prize? By the time Netflix came to assessing the algorithm for implementation, both 
the business context and requirements had shifted so dramatically that the minimal accuracy gains 
provided by that algorithm didn't justify implementation costs. The $1M algorithm was redundant and 
was never implemented 1n production! The point to take from this example 1s that in commercial 
contexts, it is critical for our models to have as much adaptability as we can provide. 


The really challenging applications of machine learning algorithms, in which our existing run once 
methodologies become less valuable, are ones where real data changes occur across time (or other 
dimensions). In these contexts, one knows that a substantial data change will occur and that existing 
models cannot be easily trained to adapt to this data change. At that point, new techniques are needed 
as well as new information. 


To adapt and gather this information, we need to become better able to predict the ways in which data 
change is liable to occur. With this information, our model building and the content of our ensembles 
can start to change in order to cover the most likely data change scenarios that we see ahead. This 
adaptation lets us pre-empt data change and reduce the adjustment time required. As we'll see later in 
this chapter, in real-world applications any reduction in the time it takes us to pivot based on data 
change is valuable. 


In the next section, we'll be looking at tools that we can use to make our models more robust to 
changing data. We'll discuss the means by which we can maintain a broad set of model options, 
simultaneously accommodating one or multiple data change scenarios, without reducing the 
performance of our models. 
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Understanding model robustness 


It's important to understand exactly what the problem is here and how and when it is presented. This 
involves defining two things; the first being robustness as it applies to machine learning algorithms. 
The second, of course, 1s data change. Some of the content 1n the first part of this section 1s at an 
introductory level, but experienced data scientists may still find value in reviewing the section! 


In academic terms, the robustness of a machine learning algorithm 1s the property that characterizes 
how effective your algorithm 1s while being applied to a dataset other than the dataset on which it 
was trained. 


Robustness testing 1s a core part of machine learning methodology in any context. The importance of 
validation techniques such as k-fold cross-validation and the use of tests when developing models for 
even the simplest contexts 1s a consequence of machine learning algorithm vulnerability to data 
change. 


Most datasets contain both a signal and noise. Noise may be predictable (and thus more easily 
managed) or it may be stochastic and difficult to treat. A dataset may contain more or less noise. 
Typically, datasets with more or less predictable noise are harder to train and test against the same 
datasets with this noise removed (which can be easily tested). 


When one has trained a model on a given dataset, it 1s almost inevitable that this model has learned 
based on both the signal and noise. The concept of overfitting 1s generally used to describe a model 
that has fit so well to a given dataset that 1t has learned to predict based on both the signal and noise, 
rendering it less powerful against other samples than a model with a less exact fit. 


Part of the goal of training a model is to reduce the impact of any local noise on learning as much as 
possible. The purpose of validation techniques that hold out a set of data to test 1s to ensure that any 
learning of noise during training happens only on noise that is local to the training set. The difference 
between training and test error can be used to understand the degree of overfitting between model 
implementations. 


We've applied cross-validation in Chapter 1, Unsupervised Machine Learning. Another useful 
means of testing models for the overfitting is to directly add random noise 1n the form of jitter to the 
training dataset. This technique was introduced via a Kaggle notebook 1n October 2015 by Alexander 
Minushkin and offers a very interesting test. The concept 1s simple; by adding jitter and looking at the 
accuracy of prediction on the training data, we can distinguish an overfitted model (whose training 
error will increase more quickly as we add jitter) froma well- or poorly-fitted model: 
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Expected decrease of accuracy in jitter test 


— Overtitted model 
— Non-Overfitted model 
— just Bad model 





In this case, we're able to plot the results of a jitter test to easily identify whether a model has overfit. 
From a very strong initial position, an overfit model will typically rapidly decline in performance as 
small amounts of jitter are added. For better-fitting models, the loss in performance with added Jitter 
is much reduced, with the degree of overfitting in a model being particularly obvious at low levels of 
added jitter (where a well-fit model will tend to outperform an overfit counterpart). 


Let's look at how we implement a jitter test for overfitting. We use a familiar score, 

accuracy score, defined as the proportion of class labels predicted correctly, as the basis for test 
scoring. Jitter is defined by simply adding random noise to the data (using np. random. normal) with 
the amount of noise defined by the configurable scale parameter: 


From. Sklearn.Meeraics amportl. accuracy score 


def jitter(X, scale): 
1f scale > OQ: 
return X + np.random.normal(0, scale, X.shape) 
PeLurn x 


Get J1ULer. tesliclassit ter, Ay VY, Metric FUNC = accuracy score, Sigmas = 
Hp. lanspace(0, Uses, 20), averaging N — 5); 
ome = | 


for SS ah Sigmas: 
averageAccuracy = 0.0 
FOr x 1m Tange (averaging N) = 
evetege, CoUracy +— Meerve FUNG. VY, Classi ie. prediecueljterer i, s))) 


out.append( averageAccuracy/averaging N) 
return (out, sigmas, np.trapz(out, sigmas) ) 


allJT = {} WOW! eBook 
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The jitter test itselfis defined as a wrapper to normal sklearn classification, given a classifier, 
training data, and a set of target labels. The classifier is then called to predict against a version of the 
data that first has the jitter operation called against it. 


At this point, we'll begin creating a number of datasets to run our jitter test over. We'll use sklearn's 
make moons dataset, commonly used as a dataset to visualize clustering and classification algorithm 
performance. This dataset is comprised of two classes whose data points create interleaving half- 
circles. By adding varying amounts of noise to make moons and using differing amounts of samples, 
we can create a range of example cases to run our jitter test against: 


import sklearn 
import sklearn.datasets 


import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning) 


eS itl 
ys = [] 


#low noise, plenty of samples, should be easy 

AU, YO: = SklGarhwdatasets,.make moons (ni. Samples—=1000, nmol1se=.05) 
XS.append (XQ) 

ys.append(y0) 


#more noise, plenty of samples 

Rly Vi = SkKICGarh.Catase sS.make Moone (i. samples=-[000, 1Ol1se=..5) 
XS.append (X1) 

ys.append (yl) 


#less noise, few samples 

AZy V2 = Skleatn, datasets.make Moons(hi Sanples=2Z00, nolce=.U5) 
XS.append (X2) 

ys.append(y2) 


#more noise, less samples, should be hard 

AS, Yo = Sklearim.Gatasets.make Moons (ni samples=200;, noise=.3) 
XS.append (X3) 

ys.append(y3) 


This done, we then create a plotter object that we'll use to show our models' performance directly 
against the input data: 


def plotter(model, X, Y, ax, npts=5000): 


xs = [] 
ys = iJ 
cs = 1 
FOr am Prange (npts) : 
xOspr = max(X[:,0])-min(X[:,0]) 
KLepor = mex (X11 ss, 1) =e le, 1) 
xX = np.random.rand()*xOspr + min(X[:,0]) 
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XS.append (x) 
ys.append(y) 
cs.append(model.predict([x,y])) 
ax.scatter(xs,ys,c=list(map(lambda x:'lightgrey' if x==0 else 'black', cs)), 
alpha=.35) 
ax. hold(True) 
ax«SCectter (x [*,0),%t4 4), 
c=list(map(lambda x:'r' if x else 'lime',Y)), 
linewidth=0,s=25,alpha=1) 
ax.set xlim([min(X[:,0]), max(X[:,0])]) 
ax. Set ylim(([man(xX[e,11), max(x[i,1]) 1) 
FeELurn 


We'll use an SVM classifier as the base model for our jitter tests: 


import sklearn.svm 
classifier = sklearn.svm.SVC() 


allJT[str(classifier)] = list() 


fig, axes = plt.subplots (nrows=2, ncols=2, figsize=(11,13)) 
1=0 
LOM cy7 Ie. Zeote, yo) % 
classifier.fit(X,y) 
plotter(classifier,X,y,ax=axes[i//2,i%2]) 
el See Classi tier) | cappera (7 lver tes tie hascsiiier, A, 
i += 1 
pilt.show () 


The jitter test provides an effective means of assessing model overfitting and performs comparably to 
cross-validation; indeed, Minushkin provides evidence that it can outperform cross-validation as a 
tool to measure model fit quality. 


Both of these tools to mitigate the overfitting work well in contexts where your algorithm is either run 
over data on a one-off basis or where underlying trends don't vary substantially. This 1s true for the 
majority of single-dataset problems (such as most academic or web repository datasets) or data 
problems where the underlying trends change slowly. 


However, there are many contexts where the data involved in modeling might change over time 1n one 
or several dimensions. This can occur because of change 1n the methods by which data is captured, 
usually because new instruments or techniques come into use. For instance, video data captured by 
commonly-available devices has improved substantially in resolution over the decade since 2005 and 
the quality (and size!) of such data has increased. Whether you're using the video frames themselves 
or instead the file size as a parameter, you'll observe noticeable shifts in the nature, quality, and 
distributions of features. 


Alternatively, changes in dataset variables might be caused by differences in underlying trends. The 
classic data schema concept of measures and dimensions comes back into play here, as we can better 
understand how data change is affected by considering what dimensions influence our measurement. 
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The key example is time. Depending on context, many variables are subject to day-of-week, month- 
of-year, or seasonal variations. In many cases, a helpful option might be to parameterize these 
variables, (as we discussed in the preceding chapter, techniques such as one-hot encoding can help 
our algorithms learn to parse such trends) particularly if we're dealing with periodic trends that are 
easily predicted (for example, the impact of month-of-year on scarf sales in a given location) and 
easily modeled. 


A more problematic type of time series trend is non-periodic change. As 1n the preceding video 
camera example, some types of time series trends change irrevocably and in ways that might not be 
trivial to predict. Telemetry from software tends to be influenced by the quality and functionality of 
the software build live at the time the telemetry was emitted. As builds change over time, the values 
sent in telemetry and the variables created from those values can change radically overmght in hard- 
to-predict ways. 


Human behavior, a hugely important factor in many datasets, helpfully changes both periodically and 
non-periodically. People shop more around seasonal holidays, but also change their shopping habits 
permanently based on new societal or technological developments. 


Some of the added complexity here comes not just from the fact that single variables and their 
distributions are affected by time series trends, but also from how relationships between relevant 
factors and their associated variables will change. The relationships between variables may change 
in quantifiable terms. One example is how, for humans, height and weight are two variables whose 
relationship varies between times and locations. The BMI feature, which we might use to track this 
relationship, shows differing distributions when sampled across periods of time or between 
locations. 


Furthermore, variables can change in another serious way; namely, their importance to a performant 
modeling algorithm may vary over time! Some variables whose values are highly relevant 1n some 
periods of time will be less relevant in others. As an example, consider how climate and weather 
variables affect agriculture markets. For some crops and the companies dealing in them, these 
variables are fairly unimportant for much of the year. At the time of crop growth and harvest, 
however, they become fundamentally important. To make this more complex, the strength of these 
factors' importance 1s also tied to location (and local climate). 


The challenge for modeling is clear. For models that are trained once and run again on new data, 
managing data change can present serious challenges. For models that are dynamically recomputed 
based on new input data, data change can still create problems as variable distributions and 
relationships change and available variables become more or less valuable in generating an effective 
solution. 


Part of the key to successfully managing data change 1n your application of ML 1s to recognize the 
dimensions (and there are common culprits) where change is probable and liable to affect the 
distributions of your features, relationships, and feature importance, which a model will attempt to 
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Once you have an understanding as to what the factors 1n your data are that are likely to influence 
overfitting, you're better positioned to develop a solution that manages these factors effectively. 


This said, it will still seem hugely challenging to build a single model that can resolve any potential 
issues. The simple response to this 1s that 1f one faces serious data change issues, the solution 
probably isn't to try to solve for them with a single model! In the next section, we'll be looking at 
ensemble methods to provide a better answer. 


Identifying modeling risk factors 


While it is in many cases quite straightforward to identify which elements present a risk to your 
model over time, it can help to employ a structured process for identification. This section briefly 
describes some of the heuristics and techniques you can employ to screen your models for the risk of 
data change. 


Most data scientists keep a data dictionary for datasets that are intended for general use or automated 
applications. This is especially likely to happen if the data or applications are complex, but keeping a 
data dictionary is generally good practice. Some of the most effective work you can do 1n identifying 
risk factors is to run through these features and tag them based on different risk types. 


Some of the tags that I tend to use include the following: 


e Longitudinally variant: Is this parameter liable to change over a long time due to longitudinal 
trends that many not be fully visible in the span of the training data that you have available? The 
most obvious example 1s the ecological seasons, which affect many areas of human behavior as 
well as the many things that depend on some more fundamental climatic variables. Other 
longitudinal trends include the financial year and the working month, but extend to include many 
other longitudinal trends relevant to your area of investigation. The life cycle of new 1Phone 
models or the population flux of voles might be an important longitudinal factor depending on the 
nature of your work. 

e Slowly changing: Is this categorical parameter likely to gain new values over time? This 
concept is borrowed from data warehousing best practices. A slowly changing dimension in the 
classical sense will gain new parameter codes (for example, as a new store opens or a new case 
is identified). These can throw your model entirely 1f not managed properly or 1f they appear in 
sufficient number. Another impact of slowly changing data, which can be more problematic to 
handle, is that it can begin to affect the distribution of your features. This can have a substantial 
impact on the effectiveness of your model. 

e Key parameter: A combination of data value monitoring and recalculation of decision 
boundaries/regression equations will often handle a certain amount of slowly changing data and 
seasonal variance well, but consider taking action should you see an unexpectedly large amount 
of new cases or case types, especially when they affect variables depended on heavily by your 
model. For this reason, also make sure that you know which variables are most relied upon by 
your solution! 


The process of tagging 1n this way 1s helpful (WOM eektus an export of your own memory) mostly 
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because it helps you to do the following: 


e Organize your expectations and develop a kind of checklist for your development of monitoring 
readiness. If you aren't able to keep track of at least your longitudinally variant and slowly 
changing parameter change, you are effectively blind to any output from your model besides 
changes in the parameters that 1t favors when recomputed and its (likely slowly declining) 
performance measure. 

e Investigate mitigation (for example, improved normalization or extra parameters that codify 
those dimensions 1n which your data is variant). In many ways, mitigation and the addition of 
parameters 1s the best solution you can tap to handle data change. 

e Set up robustness testing using constructed datasets, where your risk features are deliberately 
varied to simulate data change. Stress-test your model under these conditions and find out 
exactly how much variance it'll tolerate. With this information, you can easily set yourself up to 
use your monitoring values as an early alert system; once data change exceeds a certain safe 
threshold, you know how much degradation to expect in the model performance. 


WOW! eBook 
www.wowebook.org 


Strategies to managing model robustness 


We've discussed a number of effective ensemble techniques that allow us to balance the twin needs 
for performant and robust models. However, throughout our exposition and use of these techniques, 
we had to decide how and when we would reduce our model's performance to improve robustness. 


Indeed, a common theme in this chapter has been how to balance the conflicting objectives of creating 
an effective, performant model, without making this model too inflexible to respond to data change. 
Many of the solutions that we've seen so far have required that we trade-off one outcome against the 
other, which is less than ideal. 


At this point, it's worth our taking a slightly wider view of our options and drawing from 
complimentary techniques. The need for robust, performant statistical models within evolving 
business landscapes 1s neither new nor untreated; fields such as credit risk modeling have a long 
history of applied statistical modeling in changing domains and have developed effective decision 
management methodologies in order to succeed. Data scientists can turn some of these established 
techniques to our own benefit via using them to help organize our own models. 


One effective methodology 1s Champion/Challenger, a test-centric approach that involves running 
multiple, parallel model configurations. In addition to the model whose outputs are applied (to direct 
business activities or inform reporting), champion/challenger approaches training one or more 
alternative model configurations. 


By maintaining and monitoring multiple models, one can arrange to substitute the current model as and 
when an alternative outperforms it. This is usually done by maintaining a performance scoring 
process for all models and observing the results so that a manual decision call can be made about 
whether and when to switch to a challenger. 


While the simplest implementation may involve switching to a challenger as soon as it outperforms 
the main model, this is rarely done as there are risks around specific challenger models being 
exposed to local mimima (for example, the day-of-week or month-of-year local trends). It 1s normal to 
spend a significant period assessing a challenger model, particularly ahead of sensitive applications. 
In complex real cases, one may even want to do additional testing by providing a sample of treatment 
cases to a promising challenger to determine whether it generates significant lift over the champion. 


There 1s scope for some creativity beyond simple, "replace the challenger" succession rules. Voting- 
based approaches are quite common, where a top subset of the trained ensembles provides scores on 
a case-by-case basis and those scores treated as (weighted or unweighted) votes. Another approach 
involves using a Borda count, a voting system where each voter ranks the candidate solutions 1n 
order of preference. In the context of ensembling, one would typically assign each individual model's 
prediction a point value equal to its inverse rank (keeping each model separate!). Then one can 
combine these votes (usually experimenting with a range of different weightings) to generate a result. 


Voting can perform fairly well with a larger n Uitiber etthodels but is dependent on the specific 
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modeling context and factors like the similarity of the different voters. As we discussed earlier 1n this 
chapter, it's critical to use tests such as Pearson's correlation coefficient to ensure that your model set 
is both performant and uncorrelated. 


One may find that particular classes of input data (users, say, with specific segmentation tags) are 
more effectively treated by a given challenger and may implement a case routing system where 
multiple champions deal with different user subgroups. This approach overlaps somewhat with the 
benefits of boosting ensembles, but can help in production circumstances by separating concerns. 
However, maintaining multiple champions will increase the monitoring and oversight burden for your 
data team, so this option 1s best avoided if not entirely necessary. 


A major concern to address is how we go about scoring our models, not least because there are 
immediate practical challenges. In particular, it 1s hard to compare multiple models in real contexts, 
given that class labels (to guide correctness) typically aren't available. In predictive contexts, this 
problem is compounded by the fact that the champion model's predictions are typically used to take 
actions that alter predicted events. This activity makes it very difficult to make assertions about how a 
challenger model's predictions would've performed; by taking action based on our champion's 
predictions, we're unable to confirm the results of our models! 


The most common implementation process is to provide each challenger model with a statistically 
viable sample of the input data and then compare the lift from each approach. This approach 
inherently limits the number of challengers that one can support for some modeling problems. Another 
option is to leave just one statistically viable sample out of any treatment activity and use it to create 
a single regression test. This test is applied to the entire set of champion and challenger models, 
providing a meaningful basis for comparison. 


The downside to this approach is that the change to a more effective model will always trail the data 
change by however long it takes to generate correct class labels for the test cases. While in many 
cases this isn't crippling (the champion model remains in place for the period it takes to generate 
accurate models), it can present problems in contexts where underlying conditions change rapidly 
compared to the training time for models. 


Note 


It's worth making one brief comment on the relationship between model training time and data change 
frequency. It isn't always clearly stated as such, but the typical goal in applied machine learning 
contexts is to reduce the factor of training time to data change frequency to the smallest value 
possible. To take the worst case, 1f the length of time it takes to train a model is longer than the length 
of time that model will be accurate for (and the ratio is equal to or greater than one), your model will 
never generate current results that can directly drive current actions. In general, a high ratio should 
prompt review and adjustment activities (either an investigation into whether faster score delivery at 
lower confidence delivers more value or adjustment to the rate at which controllable environment 
variables change). 
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The smaller this ratio becomes, the more leeway your team has to apply your model's outputs to drive 
actions and generate value. Depending on how variant and quantifiable this ratio is for your modeling 
context, it can be a useful concept to promote within your organization as a health measure for your 
automated modeling solution. 


These alternative models may simply be the next best-performing ensemble configurations; they may 
be older models, kept around for observation. In sophisticated operations, some challengers are 
configured to handle different what-if scenarios (for example, what if the temperature in this region 
is 2 C below expectations or what if sales are significantly below expectations). These models may 
have been trained on the same data as the main model or on deliberately skewed or prepared data that 
simulates the what-if scenario. 


More challengers tend to be better (providing improved robustness and performance), provided that 
the challengers are not all minute variations on the same theme. Challenger models also provide a 
safe venue for innovation and testing, while observing effective challengers can provide useful 
insights into how robust your champion ensemble 1s likely to be to a range of possible environmental 
changes. 


The techniques that you've learned to apply in this section have provided us with the tools to apply 
our existing toolkit of models to real applications in evolving environments. This chapter also 
discussed complications that can arise when applying ML models to production; data change, 
between samples or across dimensions, will cause our models to become increasingly ineffective. By 
thoroughly unpacking the concept of data change, we became better able to characterize this risk and 
recognize where and how it might present itself. 


The remainder of the chapter was dedicated to techniques that provide improved model robustness. 
We discussed how to identify model degradation risk by looking at the underlying data and discussed 
some helpful heuristics to this end. We drew from existing decision management methods to learn 
about and use Champion/Challenger, a well-regarded process with a long history 1n contexts 
including applied machine learning. Champion/Challenger helps us organize and test multiple models 
in healthy competition. In conjunction with effective performance monitoring, a proactive tactical 
plan for model substitution will give you faster and more controllable management of the model life 
cycle and quality, all the while providing a wealth of valuable operational insights. 
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Further reading 


Perhaps the most wide-ranging and informative tour of Ensembles and ensemble types 1s provided by 


the Kagegle competitor, Triskelion, at http://mlwave.com/kaggle-ensembling-guide/. 


For discussion of the Netflix Prize-winning model, Pragmatic Chaos, refer to 


http://www.stat.osu.edu/~dms1/GrandPrize2009_BPC BellKor.pdf. For an explanation by Netflix on 
how a’ business contexts rendered that 5 1M-model redundant, refer to the Netflix Tech blog at 





For a walkthrough on applying random forest ensembles to commercial contexts, with plenty of space 
given to all- -important diagnostic charts and reasoning, consider Arshavir Blackwell's blog at 





For further information on random forests specifically, I find the scikit-learn documentation helpful: 
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. 





A great introduction to gradient-boosted trees 1s provided within the XGBoost documentation at 


http://xgboost.readthedocs.10/en/latest/model.html. 


For a write-up of Alexander Guschin's entry to the Otto Product Classification challenge, refer to the 


No Free Hunch blog: http://blog.kagele.com/2015/06/09/otto-product-classification-winners- 
interview-2nd-place-alexander-guschin/. 


Alexander Minushkin's Jitter test for overfitting 1s described at 


https://www.kagele.com/miniushkin/introducing-kagegle-scripts/jitter-test-for-overfitting-notebook. 
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Summary 


In this chapter, we covered a lot of ground. We began by introducing ensembles, some of the most 
powerful and popular techniques in competitive machine learning contexts. We covered both the 
theory and code needed to apply ensembles to our machine learning projects, using a combination of 
expert knowledge and practical examples. 


In addition, this chapter also dedicates a section to discussing the unique considerations that arise 
when you run models for weeks and months at a time. We discussed what data change can mean, how 
to identify it, and how to think about guarding against it. We gave specific consideration to the 
question of how to create sets of models running in parallel, which you can switch between based on 
seasonal change or performance drift in your model set. 


During our review of these techniques, we spent significant time with real-world examples with the 
specific aim of learning more about the creative mindset and broad range of knowledge required of 
the best data scientists. 


The techniques throughout this book have led up to a point that, armed with technical knowledge, 
code to reapply, and an understanding of the possibilities, you are truly able to take on any data 
modeling challenge. 
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Chapter 9. Additional Python Machine 
Learning Tools 


Over the course of the eight preceding chapters, we have examined and applied a range of techniques 
that help us enrich and model data for many applications. 


We approached the content in these chapters using a combination of Python libraries, particularly 
NumPy and Theano, while the other libraries were drawn upon as and when we needed to access 
specific algorithms. We did not spend a great deal of time discussing what other options existed in 
terms of tools, what the unique differentiators of these tools were, or why we might be interested. 


The primary goal of this final chapter is to highlight some other key libraries and frameworks that are 
available to you to use. These tools streamline and simplify the process of creating and applying 
models. This chapter presents these tools, demonstrates their application, and provides extensive 
advice regarding Further reading. 


A major contributor to succeed in solving data science challenges and being successful as a data 
scientist is having a good understanding of the latest developments 1n algorithms and libraries. As 
professionals, data scientists tend to be highly dependent on the quality of the data they use, but it is 
also very important to have the best tools available. 


In this chapter, we will review some of the best in the recent tools available to data scientists, 
identifying the benefits they offer, and discussing how to apply them alongside tools and techniques 
discussed earlier 1n this book within a consistent working process. 


WOW! eBook 
www.wowebook.org 


Alternative development tools 


Over the last couple of years, a number of new machine learning frameworks have emerged that offer 
advantages in terms of workflow. Usually these frameworks are highly focused on a specific use case 
or objective. This makes them very useful, perhaps even must-have tools, but it also means that you 
may need to use multiple workflow improvement libraries. 


With an ever-growing set of new Python ML projects being lit up to address specific workflow 
challenges, it's worth discussing two libraries that add to our existing workflow and which accelerate 
or improve the work we've done in the preceding chapters. In this chapter, we'll be introducing 
Lasagne and TensorFlow, discussing the code and capabilities of each library and identifying why 
each framework is worth considering as a part of your toolset. 
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Introduction to Lasagne 


Let's face 1t; sometimes creating models in Python takes longer than we'd like. However, they can be 
efficient for models that are more complex and offer big benefits (such as GPU acceleration and 
configurability) libraries similar to Theano can be relatively complex to use when working on simple 
cases. This is unfortunate because we often want to work with simple models, for instance, when 
we're setting up benchmarks. 


Lasagne is a library developed by a team of deep learning and music data mining researchers to work 
as an interface to Theano. It is designed specifically to nail a particular goal—to allow for fast and 
efficient prototyping of new models. 


This focus dictated how Lasagne was created, to call Theano functions and return Theano expressions 
Or numpy data types, in a much less complex and more easily understood manner than the same 
Operations written in native Theano code. 


In this section, we'll take a look at the conceptual model underlying Lasagne, apply some Lasagne 
code, and understand what the library adds to our existing practices. 


Getting to know Lasagne 


Lasagne operates using the concept of layers, a familiar concept in machine learning. A layer is a set 
of neurons and operating rules that will take an input and generate a score, label, or other 
transformations. Neural networks generally function as a set of layers that feed input data 1n at one 
end and push output values out at the other (though the ways in which this gets done vary broadly). 


It has become very popular 1n deep learning contexts to start treating individual layers as first class 
citizens. Traditionally, in machine learning work, a network would be established from layers using 
only a few parameter specifications (such as node count, bias, and weight values). 


In recent years, data scientists seeking that extra edge have begun to take increasing interest in the 
configuration of individual layers. Nowadays it is not unusual in advanced machine learning 
environments to see layers that contain sub-models and transformed inputs. Even features, nowadays, 
might skip layers as needed and new features may be added to layers partway through a model. As an 
example of some of this refinement, consider the convolutional neural network architectures 
employed by Google to solve image recognition challenges. These networks are extensively refined 
at a layer level to generate performance improvements. 


It therefore makes sense that Lasagne treats layers as its basic model component. What Lasagne adds 
to the model creation process is the ability to stack different layers into a model quickly and 
intuitively. One may simply call a class within lasagne. layers to stack a class onto your model. 
The code for this is highly efficient and looks as follows: 


10 = lasagne.layers.InputLayer (shape=X. shape) 
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ll = lasagne.layers.DenseLayer ( 
LO, TU UeCSs= LO, Doni Meer ty=LasaCne.nonm limeet1 116s. Lani) 


ig = lesagne.davyers.Densehavyer(li, Tum Units-N ChASSES,; 
nonlinearity=lasagne.nonlinearities.softmax) 


In three simple statements, we have created the basic structure of a network using simple and 
configurable functions. 


This code creates a model using three layers. The layer 10 calls the InputLayer class, acting as an 
input layer for our model. This layer translates our input dataset into a Theano tensor, based on the 
expected shape of the input (defined using the shape parameter). 


The next layers, 11 and 12 are each fully connected (dense) layers. Layer 12 1s defined as an output 
layer, with a number of units equal to the number of classes, while 11 uses the same DenseLayer 
class to create a hidden layer of 10 units. 


In addition to configuration of the standard parameters (weights, biases, unit count and nonlinearity 
type) available to the DenseLayer class, it 1s possible to employ entirely different network types 
using different classes. Lasagne provides classes for a broad set of familiar layers, including dense, 
convolutional and pooling layers, recurrent layers, normalisation and noise layers, amongst others. 
There is, furthermore, a special-purpose layer class, which provides a range of additional 
functionality. 


If something more bespoke than what these classes provide is needed, of course, the user can resort to 
defining their own layer type easily and use it in conjunction with other Lasagne classes. However, 
for a majority of prototyping and fast, iterative development contexts, this is a great amount of pre- 
prepared capability. 


Lasagne provides a similarly succinct interface to define the loss calculation for a network: 


cee OULOUL = To VeCceor ("true CUuLeuL”) 
objective = lasagne.objectives.Objective(12, 
Loss. LUMCE1ON=lLasagne,. Ob] SCI Ves.Calegorical Crossentropy) 


LOSS "= ODJECLI Ve. Cel. loss (targe = true OuUltpUuc) 


The loss function defined here is one of the many available functions, including squared error, hinge 
loss for binary and multi-class cases, and crossentropy functions. An accuracy scoring function for 
validation is also provided. 


With these two components, a loss function and a network architecture, we again have everything we 
need to train a network. To do this, we need to write a little more code: 


aid. Params = lasagne. lavyers.get all params (tz) 
UpGates = lasagne. Updates.eod(locs, all params, Jteatning “atle=1) 


train = Theano.functton (0.3 nput Vary, Crue. output), loss; Updares=updcates) 
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cel .OULpUC = Cheano.funcetion( (10.1npue var), nek Output) 


for n in xrange(100): 
train(X, y) 


This code leverages the theano functionality to train our example network, using our loss function, 
to iteratively train to classify a given set of input data. 
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Introduction to TensorFlow 


When we reviewed Google's take on the convolutional neural network (CNN) in Chapter 4, 
Convolutional Neural Networks, we found a convoluted, many-layered beast. The question of how to 
create and monitor such networks only became more important as the network scales in layer count 
and complexity to attack challenges that are more complex. 


To address this challenge, the Machine Intelligence research organisation at Google developed and 
distributed a library named TensorFlow, which exists to enable easier refinement and modeling of 
very involved machine learning models. 


TensorFlow does this by providing two main benefits; a clear and simple programming interface (in 
this case, a Python API) onto familiar structures (such as NumPy objects), and powerful diagnostic 
and graph visualisation tools, such as TensorBoard, to enable informed tuning of a data architecture. 


Getting to know TensorF low 


TensorFlow enables a data scientist to design data transformation operations as a flow across a 
computation graph. This graph can be extended and modified, while individual nodes can be tuned 
extensively, enabling detailed refinements of individual layers or model components. The 
TensorFlow workflow typically involves two phases. The first of these 1s referred to as the 
construction phase, during which a graph is assembled. 


During the construction phase, we can write code using the Python API for Tensorflow. Like Lasagne, 
TensorFlow offers a relatively simple interface to writing network layers, requiring simply that we 
specify weights and bias before creating our layers. The following example shows initial setting of 
weight and bias variables, before creating (using one line of code each) a convolutional layer and a 
simple max-pooling layer. Additionally, we use t£.placeholder to generate placeholder variables 
for our input data. 


x = tf.placeholder(tf.float32, shape=[None, 784]) 
Y . = Ube Dp lLeacenolder(tt -loato2;, Shape=-| Nowe, 1u)) 
W = tt.Varirable(tr.zeros( (5, 5S, 1, 321])) 

b = tf£.Variable(tf.zeros([32])) 

bh Cony = Ti. nes telu(Convzd.(s 2mage, W). a 0) 


h pool = max pool 2x2(h_ conv) 


This structure can be extended to include a softmax output layer, just as we did with Lasagne. 


WOU. = CisVaritable(ti.Zeros ( (1024, 10) )) 
B OUG = ©EeVetiable (tr. Zeros (| [10 )):) 


yY = Tistn«sSoOrtmax(Ti«mMatmul(n conv, W Out) + © our) 
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Again, we can see significant improvements 1n the iteration time over writing directly in Theano and 
Python libraries. Being written in C++, TensorFlow also provides performance gains over Python, 
providing advantages 1n execution time. 


Next up, we need to train and evaluate our model. Here, we'll need to write a little code to define our 
loss function for training (cross entropy, 1n this case), an accuracy function for validation and an 
optimisation method (in this case, steepest gradient descent). 


CLOSS UCL ODy = List ecuce Mean = lr .recuce Sumy. * Cio togly), 2ecucrilon tncices= 


Baye, 
train, Step = tiatraim. Gradient vescentOpteimi Zerit) .)) .MInimive(eross entropy) 
COGteCe PrediCcLiionm = Li ,eQual (li »sergmax(y 52) y, Ciserolaxt, 71) 


accuracy = Ulel Couce Meat). case eollece PreolCerOn, Lest lOato7)) 


Following this, we can simply begin running our model iteratively. This is all succinct and very 
straightforward: 


Sess. Tun (Eiseimitieligze all VvVarrables ()) 
for 1 in range (20000): 


Daten = Mhist.tlainanext. batch (50) 
1f i%100 == 
thei eCeCuracy = 2CCuUr oC 7.67 (teed. orc { 


x:batch[0], y : batch[1], keep prob: 1.0}) 
PEINME( "Step sO, Lralning accuracy <OgveC(lg Train accuracy) ) 
trait: Slep.1Un (feed. d1ce={x: Dactchi0l, yy 2 Davrchitl|y, Keep prob] 0.57) 


Print ("lest accuracy «Cc caeCCUracy.eVal (reed cLer={ 
<. MOLSte test. iMeges, 7.4 Milter. Cesc. tabels, Keep proo. 1.0) )) 


Using TensorF low to iteratively improve our models 


Even from the single example in the preceding section, we should be able to recognise what 
TensorFlow brings to the table. It offers a simple interface for the task of developing complex 
architectures and training methods, giving us easier access to the algorithms we've learnt about 
earlier in this book. 


As we know, however, developing an initial model 1s only a small part of the model development 
process. We usually need to test and dissect our models repeatedly to improve their performance. 
However, this tends to be an area where our tools are less unified in a single library or technique, and 
the tests and monitoring solutions less consistent across models. 


TensorFlow looks to solve the problem of how to get good insight into our models during iteration, in 
what it calls the execution phase of model development. During the execution phase, we can make use 
of tools provided by the TensorFlow team to explore and improve our models. 


Perhaps the most important of these tools is TensarBaand, which provides an explorable, visual 
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representation of the model we've built. TensorBoard provides several capabilities, including 
dashboards that show both basic model information (including performance measurements during 
each iteration for test and/or training). 
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In addition, TensorBoard dashboards provide lower-level information including plots of the range of 
values for weights, biases and activation values at every model layer; tremendously useful diagnostic 
information during iteration. The process of accessing this data is hassle-free and it is immediately 


useful. 
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Further to this, TensorBoard provides a detailed graph of the tensor flow for a given model. The 
tensor 1s an n-dimensional array of data (in this case, of n-many features); it's what we tend to think of 
when we use the term the input dataset. The series of operations that 1s applied to a tensor is 
described as the tensor flow and in TensorFlow it's a fundamental concept, for a simple and 
compelling reason. When refining and debugging a machine learning model, what matters 1s having 
information about the model and its operations at even a low level. 


cross_entropy trai accuracy 
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TensorBoard graphs show the structure of a model 1n variable detail. From this initial view, it 1s 
possible to dig into each component of the model and into successive sub-elements. In this case, we 
are able to view the specific operations that take place within the dropout function of our second 
network layer. We can see what happens and identify what to tweak for our next iteration. 


This level of transparency is unusual and can be very helpful when we want to tweak model 
components, especially when a model element or layer 1s underperforming (as we might see, for 
instance, from TensorBoard graphs showing layer metaparameter values or from network 
performance as a whole). 


TensorBoards can be created from event logs and generated when TensorFlow is run. This makes the 
benefits of TensorBoards easily obtained during the course of everyday development using 
TensorFlow. 
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As of late April 2016, the DeepMind team joined the Google Brain team and a broad set of other 
researchers and developers in using TensorFlow. By making TensorFlow open source and freely 
available, Google is committing to continue supporting TensorFlow as a powerful tool for model 
development and refinement. 
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Knowing when to use these libraries 


At one or two points 1n this chapter, we probably ran into the question of Okay, so, why didn't you 
just teach us about this library to begin with? It's fair to ask why we spent time digging around in 
Theano functions and other low-level information when this chapter presents perfectly good 
interfaces that make life easier. 


Naturally, I advocate using the best tools available, especially for prototyping tasks where the value 
of the work 1s more 1n understanding the general ballpark you're in, or in identifying specific problem 
classes. It's worth recognising the three reasons for not presenting content earlier in this book using 
either of these libraries. 


The first reason is that these tools will only get you so far. They can do a lot, agreed, so depending on 
the domain and the nature of that domain's problems, some data scientists may be able to rely on them 
for the majority of deep learning needs. Beyond a certain level of performance and problem 
complexity, of course, you need to understand what is needed to construct a model in Theano, create 
your own scoring function from scratch or leverage the other techniques described in this book. 


Another part of the decision to focus on teaching lower-level implementation is about the developing 
maturity of the technologies involved. At this point, Lasagne and TensorFlow are definitely worth 
discussing and recommending to you. Prior to this, when the majority of the book was written, the risk 
around discussing the libraries 1n this chapter was greater. There are many projects based on Theano 
(some of the more prominent frameworks which weren't discussed in this chapter are Keras, Blocks 
and Pylearn2) 


Even now, it's entirely possible that different libraries and tools will be the subject of discussion or 
the default working environment in a year or two years' time. This field moves extremely fast, largely 
due to the influence of key companies and research groups who have to keep building new tools as the 
old ones reach their useful limits... or 1t just becomes clear how to do things better. 


The other reason to dig in at a lower level, honestly, 1s that this is an involved book. It sets theory 
alongside code and uses the code to teach the theory. Abstracting away how the algorithms work and 
simply discussing how to apply them to crack a particular example can be tempting. The tools 
discussed 1n this chapter enable practitioners to get very good scores on some problems without ever 
understanding the functions that are being called. My opinion 1s that this is not a very good way to 
train a data scientist. 


If you're going to operate on subtle and difficult data problems, you need to be able to modify and 
define your own algorithm. You need to understand how to choose an appropriate solution. To do 
these things, you need the details provided in this book and even more very specific information that I 
haven't provided, due to the limitations of (page) space and time. At that point, you can apply deep 
learning algorithms flexibly and knowledgeably. 


Similarly, it's important to recognise what thestSois°as well or less well. At present, Lasagne fits 
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very well within that use-case where a new model 1s being developed for benchmarking or early 
passes, where the priority should be on iteration speed and getting results. 


TensorFlow, meanwhile, fits later into the development lifespan of a model. When the easy gains 
disappear and it's necessary to spend a lot of time debugging and improving a model, the relatively 
quick iterations of TensorFlow are a definite plus, but it's the diagnostic tools provided by 
TensorBoard that present an overwhelming value-add. 


There 1s, therefore, a place for both libraries in your toolset. Depending on the nature of the problem 
at hand, these libraries and more will prove to be valuable assets. 
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Further reading 


The Lasagne User Guide is thorough and worth reading. Find it at 


http://lasagne.readthedocs.10/en/latest/index. html. 


ercamerias find the TensorFlow tutorials at 
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Summary 


In this final chapter, we moved some distance from our previous discussions of algorithms, 
configuration and diagnosis to consider tools that improve our experience when implementing deep 
learning algorithms. 


We discovered the advantages to using Lasagne, an interface to Theano designed to accelerate and 
simplify early prototyping of our models. Meanwhile, we examined TensorFlow, the library 
developed by Google to aid Deep Learning model adjustment and optimization. TensorFlow offers us 
a remarkable amount of visibility of model performance, at minimal effort, and makes the task of 
diagnosing and debugging a complex, deep model structure much less challenging. 


Both tools have their own place in our processes, with each being appropriate for a particular set of 
problems. 


Over the course of this book as a whole, we have walked through and reviewed a broad set of 
advanced machine learning techniques. We went from a position where we understood some 
fundamental algorithms and concepts, to having confident use of a very current, powerful and sought- 
after toolset. 


Beyond the techniques, though, this book attempts to teach one further concept, one that's much harder 
to teach and to learn, but which underpins the best performance in machine learning. 


The field of machine learning 1s moving very fast. This pace 1s visible in new and improved scores 
that are posted almost every week in academic journals or industry white papers. It's visible in how 
training examples like MNIST have moved quickly from being seen as meaningful challenges to being 
toy problems, the deep learning version of the Iris dataset. Meanwhile, the field moves on to the next 
big challenge; CIFAR-10, CIFAR-100. 


At the same time, the field moves cyclically. Concepts introduced by academics like Yann LeCun 1n 
the 80's are in resurgence as computing architectures and resource growth make their use more viable 
over real data at scale. To use many of the most current techniques at their best limits, it's necessary to 
understand concepts that were defined decades ago, themselves defined on the back of other concepts 
defined still longer ago. 


This book tries to balance these concerns. Understanding the cutting edge and the techniques that exist 
there is critical; understanding the concepts that'll define the new techniques or adjustments made in 
two or three years’ time 1s equally important. 


Most important of all, however, is that this book gives you an appreciation of how malleable these 
architectures and approaches can be. A concept consistently seen at the top end of data science 
practice 1s that the best solution to a specific problem is a problem-specific solution. 


This is why top Kaggle contest winners perforin extensive feature preparation and tweak their 


architectures. It's why TensorFlow was written to allow clear vision of granular properties of ones' 
architectures. Having the knowledge and the skills to tweak implementations or combine algorithms 
fluently 1s what it takes to have true mastery of machine learning techniques. 


Through the many techniques and examples reviewed within this book, it is my hope that the ways of 
thinking about data problems and a confidence in manipulating and configuring these algorithms has 
been passed on to you as a practicing data scientist. The many recommended Further reading 
examples in this book are largely intended to further extend that knowledge and help you develop the 
skills taught in this book. 


Beyond that, I wish you all the best of luck 1n your model building and configuration. I hope that you 
learn for yourself just how enjoyable and rewarding this field can be! 
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Chapter 10. Chapter Code Requirements 


This book's content leverages openly available data and code, including open source Python libraries 
and frameworks. While each chapter's example code 1s accompanied by a README file documenting 
all the libraries required to run the code provided in that chapter's accompanying scripts, the content 
of these files is collated here for your convenience. 

It is recommended that you already have some libraries that are required for the earlier chapters when 
working with code from any later chapter. These requirements are identified using keywords. It is 
particularly important to set up the libraries mentioned in Chapter 1, Unsupervised Machine 


Learning, for any content provided later in the book. The requirements for every chapter are given in 
the following table: 


e Python 3 (3.4 recommended) 
1 e sklearn (NumPy, SciPy) 
e matplothb 
2 Eee 
ee ee 
e Natural Language Toolkit (NLTK) 
e BeautifulSoup 
b e Twitter API account 


ee 
e Lasagne 
e TensorFlow 
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Appendix A. Biblography 


This course 1s packaged keeping your journey 1n mind. It includes content from the following Packt 
products: 


e Python Machine Learning, Sebastian Raschka 
e Designing Machine Learning Systems with Python, David Julian 
e Advanced Machine Learning with Python, John Hearty 
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